如何解决如何通过WSL2的NAT使用多节点openmpi?
使用一台WSL2“机器” wsl001
和一台真正的Linux机器linux002
,我注意到我什至不能简单地按照https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems运行mpirun --host linux hostname
:
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: wsl001
Remote host: linux002
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g.,iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings,or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g.,on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
我认为最后一点是问题所在,因为ssh linux002 mpirun hostname
工作正常。
使用--mca plm_base_verbose 10
标志,我注意到了这一行
[wsl001:18696] [[11212,0],0] plm:rsh: final template argv:
/usr/sbin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "734789632" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "wsl[3:48],linux[3:1]@0(2)" -mca orte_hnp_uri "734789632.0;tcp://172.17.45.213:42213" --mca plm_base_verbose "10" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "734789632.0;tcp://172.17.45.213:42213" -mca pmix "^s1,s2,cray,isolated"
使用WSL内部NAT IP 172.17.45.213代替外部IP。是的,当然,WSL2网络会出现问题...正如OpenMPI FAQ所述,“ Open MPI在单个MPI作业中在主机之间打开随机TCP,有时在主机之间打开随机UDP端口”,所以我不能简单地将特定端口转发到WSL计算机从其主机,也不清楚SSH隧道如何在这里提供帮助。由于WSL机器的内部IP不会保持恒定,因此我什至无法为SSH端口进行永久转发(加上Windows主机为其自身的SSHD实例阻塞了端口22,即使未使用它也是如此)。
是否还有其他选择可以使WSL2机器在OpenMPI环境中正常工作?使SSH也能以其他方式工作是否足够?还是WSL-NAT仍会弄乱端口转发?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。