如何解决Slurm MPI 错误:ORTE 守护进程失败
我在集群上遇到了 Slurm 和 openMPI 的一些问题。每当我运行任何使用 mpirun
的作业时,都会收到以下错误:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly Failed after launch and before
communicating back to mpirun. This Could be caused by a number
of factors,including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
这个问题突然出现,而且这个问题似乎在计算节点中无处不在。
看似相关,srun
现在也失败了,并显示以下消息:
srun: error: Task launch for <jobid> Failed on node <nodename>: Job credential expired
srun: error: Application launch Failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
感谢任何人的帮助!
编辑:添加示例
如果我在头节点上运行 mpirun hostname
,一切正常。但是,在 slurm 分配 (salloc
) 中,当我运行 mpirun hostname
时,出现错误。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。