如何解决使用 pytorch 闪电进行多 GPU 训练时出错
以下代码适用于单个 GPU,但在使用多个 GPU 时引发错误 运行时错误:只能为标量输出隐式创建 grad
代码
def forward(
self,input_ids,attention_mask=None,decoder_input_ids=None,decoder_attention_mask=None,lm_labels=None
):
return self.model(
input_ids,attention_mask=attention_mask,decoder_input_ids=decoder_input_ids,decoder_attention_mask=decoder_attention_mask,labels=lm_labels,)
def _step(self,batch):
lm_labels = batch["target_ids"]
# lm_labels[lm_labels[:,:] == self.tokenizer.pad_token_id] = -100
outputs = self(
input_ids=batch["source_ids"],attention_mask=batch["source_mask"],lm_labels=lm_labels,decoder_attention_mask=batch['target_mask']
)
loss = outputs[0]
return loss
def training_step(self,batch,batch_idx):
loss = self._step(batch)
return {"loss": loss}
损失值是一个缩放器: 张量(12.8875,设备='cuda:1',grad_fn=NllLossBackward) 此错误背后的原因可能是什么?
回溯(最近一次调用最后一次): 文件“training_trial.py”,第 390 行,在 trainer.fit(模型) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”,第510行,合适 结果 = self.accelerator_backend.train() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第57行,火车 返回 self.train_or_test() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第74行,train_or_test 结果 = self.trainer.train() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”,第561行,火车 self.train_loop.run_training_epoch() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第549行,在run_training_epoch batch_output = self.run_training_batch(batch,batch_idx,dataloader_idx) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第704行,run_training_batch self.optimizer_step(优化器,opt_idx,batch_idx,train_step_and_backward_closure) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第490行,在optimizer_step using_lbfgs=is_lbfgs,文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py”,第1296行,在optimizer_step 优化器.步骤(关闭=优化器_关闭) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py”,第286行,步骤 self.__optimizer_step(*args,closure=closure,profiler_name=profiler_name,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py”,第144行,在_optimizer_step 优化器.step(闭包=闭包,*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/optim/lr_scheduler.py”,第67行,包装器 返回包装(*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/transformers/optimization.py”,第318行,步骤 损失 = 关闭() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第699行,train_step_and_backward_closure self.trainer.hiddens 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第802行,training_step_and_backward self.backward(结果,优化器,opt_idx) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第829行,向后 result.closure_loss、优化器、opt_idx、*args、**kwargs 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第109行,向后 模型.向后(closure_loss,优化器,opt_idx,*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py”,第1162行,向后 loss.backward(*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/tensor.py”,第221行,向后 torch.autograd.backward(自我,渐变,retain_graph,create_graph) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/autograd/init.py”,第126行,向后 grad_tensors = make_grads(tensors,grad_tensors) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/autograd/init.py”,第 50 行,在 _make_grads raise RuntimeError("只能为标量输出隐式创建grad") 运行时错误:只能为标量输出隐式创建 grad
解决方法
添加training_step_end() 参考:https://github.com/PyTorchLightning/pytorch-lightning/issues/4073
def training_step_end(self,training_step_outputs):
return {'loss': training_step_outputs['loss'].sum()}
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。