如何解决Bash、Conda、Docker 和 Ray:应该向 Ray 提供哪些启动命令才能在运行时正确获取 docker 容器中的 bash 配置文件?
我正在尝试使用 Ray 和 Docker 在 EC2 上以编程方式启动作业。我想在我的 Docker 容器中使用 conda 进行包管理。我已经想出了如何构建容器,如果我运行
docker run -i -t my_container:my_tag /bin/bash
我可以在本地启动容器中的作业。问题是,当我将 Ray 添加到图片中以远程启动作业时,Ray 失败并出现以下错误:
start: ray: command not found
Cluster: my-cluster
Checking AWS environment settings
AWS config
IAM Profile: ray-head-v1
EC2 Key pair (head & workers): [redacted]
VPC subnets (head & workers): [redacted]
EC2 Security groups (head & workers): [redacted]
EC2 AMI (head & workers): [redacted]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic,due to --yes]
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=[redacted]]
Launched instance i-067e250cc8591da86 [state=pending,info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/6] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available,retrying in 10 seconds
Not yet available,retrying in 10 seconds
Received: 3.21.104.163
SSH still not available SSH command Failed.,retrying in 5 seconds.
SSH still not available SSH command Failed.,retrying in 5 seconds.
Success.
Updating cluster configuration. [hash=1e011279ffec6f94b2bff4ebf536e6966be5c79a]
New status: syncing-files
[3/6] Processing file mounts
[4/6] No worker file mounts to sync
New status: setting-up
[3/6] No initialization commands to run.
[4/6] No setup commands to run.
[6/6] Starting the Ray runtime
New status: update-Failed
!!!
SSH command Failed.
!!!
Failed to setup head node.
在这一点上,我已经达到了我对 Ray 和 Docker 如何交互的理解的极限。我认为问题在于 head_start_ray_commands
以某种方式传递给 docker run
。由于 Docker 使用 sh
shell 来运行命令,因此 bash 配置文件的来源不正确,因此 conda 和 ray 等包无法运行。这解释了为什么当我在本地容器实例中以交互模式启动 bash shell 时容器没有任何问题。我曾尝试在 /bin/bash --login
的开头添加 head_start_ray_commands
但这似乎只会导致整个程序冻结。
在执行命令之前让 Ray 获取 bash 配置文件的正确方法是什么?如果这是不可能的,有没有更好的方法来做到这一点?作为参考,这是我当前的光线配置:
init:
address: null
remote: {}
cluster:
cluster_name: my-cluster
min_workers: 0
max_workers: 2
initial_workers: 0
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
docker:
image: [redacted]
container_name: 'my-container'
pull_before_run: true
run_options: ["--gpus 'all'"]
provider:
type: aws
region: us-east-2
availability_zone: us-east-2a,us-east-2b
cache_stopped_nodes: false
key_pair:
key_name: [redacted]
auth:
ssh_user: ubuntu
head_node:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
worker_nodes:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
file_mounts: {}
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
编辑
最简单的解决方法似乎是完全避免使用 conda 而支持 venv。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。