微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

预计所有张量都在同一设备上,但发​​现至少有两个设备,cpu 和 cuda:0在方法 wrapper_addmm

如何解决预计所有张量都在同一设备上,但发​​现至少有两个设备,cpu 和 cuda:0在方法 wrapper_addmm

为了检测工具,我训练了一个更快的 r cnn。我已经定义了我的模型并且一切正常。但是为了有一个没有全局变量的更清晰的代码,我尝试编写一个 MyModel 类,它会自动定义每个对象并训练模型。所以在这个类上我定义了一个名为 self.dataset = ToolDataset 的类。

在第一堂课上,我定义了我的输入(一个图像)和我的输出一个目标,它是一个带有 bBox标签、区域……的字典)。 然后我构建了一个数据加载器(所以我有一个self.data_loader),并且我使用了引擎库的函数train_one_epoch。在这函数中,我输入了我的模型(一个更快的 r cnn)、我的数据加载器和 cuda:0 的设备(我打印了它)。这个函数在我的数据加载器上迭代。该函数定义了一个图像列表和一个目标列表,并将列表的值转换为好的设备。 然后它调用model(images,targets)在这一步中,我在建立两个设备时遇到了错误(我在消息末尾粘贴了错误)。

即使每个张量(我的图像和目标字典的每个值)都为命令 tensor.is_cuda 返回 True,我仍然收到错误消息。所以我真的不明白为什么错误说我也有cpu设备。我向您展示我的函数 train 、 train_one_epoch 以及我的变量图像和目标:

训练方法

def train(self,num_epoch = 10,gpu = True):
        
        if gpu : 
            CUDA_LAUNCH_BLOCKING="1"

            #torch.set_default_tensor_type(torch.FloatTensor) 
            model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
            use_cuda = torch.cuda.is_available()
            device = torch.device("cuda:0" if use_cuda else "cpu")
            model.to(device)
            if self.multi_object_detection == False : 
                num_classes = 2 # ['Tool','background']
            else : 
                print("need to set a multi object detection code")

            in_features = torch.tensor(model.roi_heads.Box_predictor.cls_score.in_features,dtype = torch.int64).to(device)
            print("in_features = {}".format(in_features))
            model.roi_heads.Box_predictor = FastRCNNPredictor(in_features,num_classes)
            print( "model.roi_heads.Box_predictor {}".format( model.roi_heads.Box_predictor))
            
            model_parameters = filter(lambda p: p.requires_grad,model.parameters())
            #params = sum([np.prod(p.size()) for p in model_parameters])
            params = [p for p in model.parameters() if p.requires_grad]

            
            optimizer = torch.optim.SGD(params,lr=0.001,momentum=0.9,weight_decay=0.0005)
            lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,step_size=10,gamma=0.5)
            gc.collect()
            num_epochs = 5
            FILE_model_dict_gpu = "model_state_dict__gpu_lab2_and_lab7_5epoch.pth"
            list_of_list_losses = []
            print("device = ",device)
            
            if (self.data_loader.dataset) == None :
                self.build_DataLoader(device)
            
            for epoch in tqdm(range(num_epochs)):

                # Train for one epoch,printing every 10 iterations
                train_his_,list_losses,list_losses_dict = train_one_epoch(model,optimizer,self.data_loader,device,epoch,print_freq=10)
                list_of_list_losses.append(list_losses)
                # Compute losses over the validation set
                #val_his_ = validate_one_epoch(model,val_data_loader,print_freq=10)

                # Update the learning rate
                print("lr before update : ",lr_scheduler)
                lr_scheduler.step()
                print("lr after update : ",lr_scheduler)
                # Store loss values to plot learning curves afterwork.
                if epoch == 0: 
                    train_history = {k: [v] for k,v in train_his_.items()}
                    #val_history = {k: [v] for k,v in val_his_.items()}
                else: 
                    for k,v in train_his_.items():train_history[k] += [v]
                #   for k,v in val_his_.items():val_history[k] += [v]

                # On peut save le modèle dans la boucle en ajoutant un critère : si la validation decroit
                # torch.save(model,save_path)

                torch.cuda.empty_cache()
                gc.collect()

train_one_epoch 函数(我打印了一些信息,将在消息末尾的输出显示


def train_one_epoch(model,data_loader,print_freq):

    model.train()
    metric_logger = utilss.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr',utilss.SmoothedValue(window_size=1,fmt='{value:.6f}'))
    header = 'Epoch: [{}]'.format(epoch)
    list_losses = []
    list_losses_dict = []
    for i,values in tqdm(enumerate(metric_logger.log_every(data_loader,print_freq,header))):
        images,targets = values
        for image in images : 
            print("before the to(device) operation,image.is_cuda = {}".format(image.is_cuda))
        images = list(image.to(device,dtype=torch.float) for image in images)
        targets = [{k: v.to(device) for k,v in t.items()} for t in targets]
        #images = [image.cuda() for image in images]
        for image in images : 
            print(image)
            print("after the to(device) operation,image.is_cuda = {}".format(image.is_cuda))
        for target in targets :
            for t,dict_value in target.items():
                print("after the to(device) operation,dict_value.is_cuda = {}".format(dict_value.is_cuda))

        print("images = {}".format(images))
        print("targets = {}".format(targets))

        # Feed the training samples to the model and compute the losses
        loss_dict = model(images,targets)
        losses = sum(loss for loss in loss_dict.values())

        # reduce losses over all GPUs for logging purposes
        loss_dict_reduced = utilss.reduce_dict(loss_dict)
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
        loss_value = losses_reduced.item()
        print("Loss is {},stopping training".format(loss_value))
        if not math.isfinite(loss_value):
            print("Loss is {},stopping training".format(loss_value))
            print(loss_dict_reduced)
            sys.exit(1)
        list_losses.append(loss_value)

        # Pytorch function to initialize optimizer
        optimizer.zero_grad()
        # Compute gradients or the backpropagation
        losses.backward()
        # Update current gradient
        optimizer.step()

我向你展示了我的输出错误包括我的图像和目标,以及错误):

in_features = 1024
model.roi_heads.Box_predictor FastRCNNPredictor(
  (cls_score): Linear(in_features=1024,out_features=2,bias=True)
  (bBox_pred): Linear(in_features=1024,out_features=8,bias=True)
)
device =  cuda:0

100%|██████████| 515/515 [00:00<00:00,112118.06it/s]
100%|██████████| 761/761 [00:00<00:00,111005.96it/s]
  0%|          | 0/5 [00:00<?,?it/s]
0it [00:00,?it/s]

before the to(device) operation,image.is_cuda = True
tensor([[[0.0078,0.0078,...,0.0000,0.0000],[0.0078,0.0118,0.0118],[0.0235,0.0235,0.0235],[0.0353,0.0353,0.0314,0.0314]],[[0.0078,0.0039,0.0039],0.0157,0.0157],0.0235]],0.0078],0.0196,0.0196],0.0275,0.0275]]],device='cuda:0')
after the to(device) operation,image.is_cuda = True
after the to(device) operation,dict_value.is_cuda = True
after the to(device) operation,dict_value.is_cuda = True
images = [tensor([[[0.0078,device='cuda:0')]
targets = [{'Boxes': tensor([[1118.8964,1368.9186,399.3243],[1043.0958,111.4863,1332.4319,426.1295]],device='cuda:0',dtype=torch.float64),'labels': tensor([1,1],device='cuda:0'),'index': tensor([311],'area': tensor([99839.9404,91037.6485],'iscrowd': tensor([0],device='cuda:0')}]

/home/nathaneberrebi/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448278899/work/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input,kernel_size,stride,padding,dilation,ceil_mode)
0it [00:02,?it/s]
  0%|          | 0/5 [00:02<?,?it/s]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-51a35da5b1fe> in <module>
----> 1 class_model.train()

<ipython-input-7-d44d099a7743> in train(self,num_epoch,gpu)
    144 
    145                 # Train for one epoch,printing every 10 iterations
--> 146                 train_his_,print_freq=10)
    147                 list_of_list_losses.append(list_losses)
    148                 # Compute losses over the validation set

<ipython-input-6-347c12a81a2f> in train_one_epoch(model,print_freq)
    519 
    520         # Feed the training samples to the model and compute the losses
--> 521         loss_dict = model(images,targets)
    522         losses = sum(loss for loss in loss_dict.values())
    523 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self,*input,**kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input,**kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks,non_full_backward_hooks = [],[]

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self,images,targets)
     95             features = OrderedDict([('0',features)])
     96         proposals,proposal_losses = self.rpn(images,features,targets)
---> 97         detections,detector_losses = self.roi_heads(features,proposals,images.image_sizes,targets)
     98         detections = self.transform.postprocess(detections,original_image_sizes)
     99 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self,[]

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/roi_heads.py in forward(self,image_shapes,targets)
    752         Box_features = self.Box_roi_pool(features,image_shapes)
    753         Box_features = self.Box_head(Box_features)
--> 754         class_logits,Box_regression = self.Box_predictor(Box_features)
    755 
    756         result: List[Dict[str,torch.Tensor]] = []

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self,[]

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/faster_rcnn.py in forward(self,x)
    280             assert list(x.shape[2:]) == [1,1]
    281         x = x.flatten(start_dim=1)
--> 282         scores = self.cls_score(x)
    283         bBox_deltas = self.bBox_pred(x)
    284 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self,[]

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py in forward(self,input)
     94 
     95     def forward(self,input: Tensor) -> Tensor:
---> 96         return F.linear(input,self.weight,self.bias)
     97 
     98     def extra_repr(self) -> str:

~/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py in linear(input,weight,bias)
   1845     if has_torch_function_variadic(input,weight):
   1846         return handle_torch_function(linear,(input,weight),input,bias=bias)
-> 1847     return torch._C._nn.linear(input,bias)
   1848 
   1849 

RuntimeError: Expected all tensors to be on the same device,but found at least two devices,cpu and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm)

非常感谢您的帮助,我一直有这个问题。由于相同的错误,我无法 torch.jit.trace 我的最后一个模型(在尝试使用类清理我的代码以仅使用一个函数序列自动构建每个对象之前)。我需要修复它才能在 C++ 代码中使用这个模型。 如果您需要更多信息,请告诉我。

这是我的 toch 环境:

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.8.0-59-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3060 Laptop GPU
Nvidia driver version: 460.80
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0a0+33b2469
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] mkl                       2021.2.0           h06a4308_296  
[conda] mkl-service               2.4.0            py38h497a2fe_0    conda-forge
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h1abd341_0    conda-forge
[conda] numpy                     1.18.5                   pypi_0    pypi
[conda] numpy-base                1.20.2           py38hfae3a4d_0  
[conda] numpydoc                  1.1.0                      py_1    conda-forge
[conda] pytorch                   1.9.0           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torch                     1.9.0                    pypi_0    pypi
[conda] torchaudio                0.9.0                      py38    pytorch
[conda] torchvision               0.10.0               py38_cu111    pytorch

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?