如何解决MPI 程序在集群上的 Singularity 容器中运行
我正在尝试通过集群上的奇点容器运行 MPI 应用程序,并首先测试一个简单的程序,但我遇到了麻烦。
这是测试程序:
program hello
include 'mpif.h'
integer rank,size,ierror,tag,status(MPI_STATUS_SIZE)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
print*,'node',': Hello world'
call MPI_FINALIZE(ierror)
end
我从 https://sylabs.io/guides/3.5/user-guide/mpi.html 得到以下奇点配方。我在本地生成 sif 容器,然后将其移动到集群:
Bootstrap: docker
From: ubuntu:latest
%environment
export OMPI_DIR=/opt/ompi
export SINGULARITY_OMPI_DIR=$OMPI_DIR
export SINGULARITYENV_APPEND_PATH=$OMPI_DIR/bin
export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$OMPI_DIR/lib
%post
echo "Installing required packages..."
apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file
echo "Installing Open MPI"
export OMPI_DIR=/opt/ompi
export OMPI_VERSION=4.0.1
export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-$OMPI_VERSION.tar.bz2"
mkdir -p /tmp/ompi
mkdir -p /opt
# Download
cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
# Compile and install
cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make install
我在作业文件中加载了比集群中相同的 openmpi 环境:
module load OpenMPI/4.0.1-GCC-8.3.0
singularity exec mpicont.sif bash script
mpirun -np 4 singularity exec mpicont.sif ./here/hello
对脚本的第一次奇点调用编译文件的位置:
export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
export MANPATH=$OMPI_DIR/share/man:$MANPATH
mpif90 -o hello hello.f90
我发现第一次调用运行良好,并生成了可执行文件 hello,但 mpirun 命令失败并出现 flowwing 错误:
anode239:04778] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[anode239:04778] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /scratch
Error: Read-only file system
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[anode239:04778] [[11585,0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 346
[anode239:04778] [[11585,0] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 141
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[anode239:04778] [[11585,0] ORTE_ERROR_LOG: Error in file ess_pmi_module.c at line 416
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
错误的根源是什么以及如何修复?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。