如何解决在训练ONNX的预训练模型Emotion FerPlus时引发异常“ cuDNN故障8:CUDNN_STATUS_EXECUTION_FAILED”
我正在测试训练Emotion FerPlus
情绪识别模型。
训练有cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed
错误。
我正在使用Nvidia GPU TitanRTX 24G
。
然后更改minibatch_size from 32 to 1
。但是仍然有错误。
我正在使用cntk-GPU泊坞窗。
完整的错误消息是
About to throw exception 'cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn,&C::One,m_inT,ptr(in),*m_kernelT,ptr(kernel),*m_conv,m_fwdAlgo.selectedAlgo,ptr(workspace),workspace.BufferSize(),&C::Zero,m_outT,ptr(out))'
cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn,ptr(out))
Traceback (most recent call last):
File "train.py",line 193,in <module>
main(args.base_folder,args.training_mode)
File "train.py",line 124,in main
trainer.train_minibatch({input_var : images,label_var : labels})
File "/root/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/train/trainer.py",line 184,in train_minibatch
device)
File "/root/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/cntk_py.py",line 3065,in train_minibatch
return _cntk_py.Trainer_train_minibatch(self,*args)
RuntimeError: cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn,ptr(out))
[CALL STACK]
[0x7fc04da7ce89] + 0x732e89
[0x7fc045a71aaf] + 0xeabaaf
[0x7fc045a7b613] Microsoft::MSR::cntk::CuDnnConvolutionEngine<float>:: ForwardCore (Microsoft::MSR::cntk::Matrix<float> const&,Microsoft::MSR::cntk::Matrix<float> const&,Microsoft::MSR::cntk::Matrix<float>&,Microsoft::MSR::cntk::Matrix<float>&) + 0x1a3
[0x7fc04dd4f8d3] Microsoft::MSR::cntk::ConvolutionNode<float>:: ForwardProp (Microsoft::MSR::cntk::FrameRange const&) + 0xa3
[0x7fc04dfba654] Microsoft::MSR::cntk::computationNetwork::PARTraversalFlowControlNode:: ForwardProp (std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&,Microsoft::MSR::cntk::FrameRange const&) + 0xf4
[0x7fc04dcb6e33] std::_Function_handler<void (std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&),void Microsoft::MSR::cntk::computationNetwork::ForwardProp<std::vector<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>>>>(std::vector<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>>> const&)::{lambda(std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&)#1}>:: _M_invoke (std::_Any_data const&,std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&) + 0x63
[0x7fc04dd04ed9] void Microsoft::MSR::cntk::computationNetwork:: TravserseInSortedGlobalEvalOrder <std::vector<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>>> const&,std::function<void (std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&)> const&) + 0x5b9
[0x7fc04dca64da] cntk::CompositeFunction:: Forward (std::unordered_map<cntk::Variable,std::shared_ptr<cntk::Value>,std::hash<cntk::Variable>,std::equal_to<cntk::Variable>,std::allocator<std::pair<cntk::Variable const,std::shared_ptr<cntk::Value>>>> const&,std::unordered_map<cntk::Variable,std::shared_ptr<cntk::Value>>>>&,cntk::DeviceDescriptor const&,std::unordered_set<cntk::Variable,std::allocator<cntk::Variable>> const&,std::allocator<cntk::Variable>> const&) + 0x15da
[0x7fc04dc3d603] cntk::Function:: Forward (std::unordered_map<cntk::Variable,std::allocator<cntk::Variable>> const&) + 0x93
[0x7fc04ddbf91b] cntk::Trainer:: ExecuteForwardBackward (std::unordered_map<cntk::Variable,std::shared_ptr<cntk::Value>>>>&) + 0x36b
[0x7fc04ddc06e4] cntk::Trainer:: TrainLocalMinibatch (std::unordered_map<cntk::Variable,bool,cntk::DeviceDescriptor const&) + 0x94
[0x7fc04ddc178a] cntk::Trainer:: TrainMinibatch (std::unordered_map<cntk::Variable,cntk::DeviceDescriptor const&) + 0x5a
[0x7fc04ddc1852] cntk::Trainer:: TrainMinibatch (std::unordered_map<cntk::Variable,cntk::DeviceDescriptor const&) + 0x52
[0x7fc04eb2db22] + 0x229b22
[0x7fc057ea15e9] PyCFunction_Call + 0xf9
[0x7fc057f267c0] PyEval_EvalFrameEx + 0x6ba0
[0x7fc057f29b49] + 0x144b49
[0x7fc057f28df5] PyEval_EvalFrameEx + 0x91d5
[0x7fc057f29b49] + 0x144b49
[0x7fc057f28df5] PyEval_EvalFrameEx + 0x91d5
[0x7fc057f29b49] + 0x144b49
[0x7fc057f28df5] PyEval_EvalFrameEx + 0x91d5
[0x7fc057f29b49] + 0x144b49
[0x7fc057f29cd8] PyEval_EvalCodeEx + 0x48
[0x7fc057f29d1b] PyEval_EvalCode + 0x3b
[0x7fc057f4f020] PyRun_FileExFlags + 0x130
[0x7fc057f50623] PyRun_SimpleFileExFlags + 0x173
[0x7fc057f6b8c7] Py_Main + 0xca7
[0x400add] main + 0x15d
[0x7fc056f06830] __libc_start_main + 0xf0
[0x4008b9]
解决方法
CNTK现在处于维护模式(已基本弃用)。虽然CNTK可以很好地导出到ONNX,但是导入ONNX模型并没有得到很好的支持。
ONNX Runtime https://github.com/microsoft/onnxruntime现在支持培训,因此请尝试一下。 ONNX Runtime培训正在积极开发并得到支持,因此,如果某些工作无法正常进行,则很可能会很快解决问题。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。