微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Slurm MPI 错误:ORTE 守护进程失败

如何解决Slurm MPI 错误:ORTE 守护进程失败

我在集群上遇到了 Slurm 和 openMPI 的一些问题。每当我运行任何使用 mpirun 的作业时,都会收到以下错误

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly Failed after launch and before
communicating back to mpirun. This Could be caused by a number
of factors,including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

这个问题突然出现,而且这个问题似乎在计算节点中无处不在。

看似相关,srun 现在也失败了,并显示以下消息:

srun: error: Task launch for <jobid> Failed on node <nodename>: Job credential expired
srun: error: Application launch Failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

感谢任何人的帮助!

编辑:添加示例

如果我在头节点上运行 mpirun hostname,一切正常。但是,在 slurm 分配 (salloc) 中,当我运行 mpirun hostname 时,出现错误

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。