AWS Batch 将 tar.gz 文件提取到 /opt/ml/model 失败并显示 OSError:[Errno 30] 只读文件系统

如何解决AWS Batch 将 tar.gz 文件提取到 /opt/ml/model 失败并显示 OSError:[Errno 30] 只读文件系统

我们有 python 代码,它在 docker 容器中执行以下操作

import boto3
import tarfile


s3 = boto3.client('s3')

s3.download_file("dev-bucket","test/model.tar.gz","/opt/ml/model/model.tar.gz")

tar = tarfile.open("/opt/ml/model/model.tar.gz",'r:gz')
tar.extractall(path="/opt/ml/model")

但是,作业在提取时失败,并显示“OSError: [Errno 30] Read-only file system”。完整的堆栈跟踪是:

回溯(最近一次调用最后一次):

>   File "inference.py",line 6
>     tar.extractall(path="/opt/ml/model")   File "/opt/conda/lib/python3.7/tarfile.py",line 2002,in extractall
>     numeric_owner=numeric_owner)   File "/opt/conda/lib/python3.7/tarfile.py",line 2044,in extract
>     numeric_owner=numeric_owner)   File "/opt/conda/lib/python3.7/tarfile.py",line 2114,in _extract_member
>     self.makefile(tarinfo,targetpath)   File "/opt/conda/lib/python3.7/tarfile.py",line 2163,in makefile
>     copyfileobj(source,target,tarinfo.size,ReadError,bufsize)   File "/opt/conda/lib/python3.7/tarfile.py",line 250,in copyfileobj
>     dst.write(buf) OSError: [Errno 30] Read-only file system

Dockerfile 如下:

FROM continuumio/miniconda3

# use python3.7
RUN /opt/conda/bin/conda install python=3.7

# Update conda
RUN /opt/conda/bin/conda update -n base conda

# Install build-essential
RUN apt-get update && apt-get install -y build-essential \
    wget \
    nginx \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN conda install -y pandas==0.25.1 scikit-learn==0.21.2 s3fs==0.4.2
RUN pip install pyarrow==1.0.0 mxnet joblib==0.13.2 boto3

CMD [ "/bin/bash" ]

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"


RUN mkdir -p /opt/ml/model
RUN chmod -R +w /opt/ml/model
RUN mkdir -p /opt/ml/input/data
# Set up the program in the image
COPY helloworld /opt/program
WORKDIR /opt/program

解决方法

数据量很大,10GB 的默认卷被填满导致卷变为只读。解决方案是使用启动模板并附加额外的卷。这为我解决了这个问题。

详细说明:

Please note that by default,Docker allocates 10 gibibytes (GiB) of storage for each volume it creates on an Amazon ECS container instance. If a volume reaches the 10-GiB limit,then you can't write any more data to that volume without causing the container instance to crash or the filesystem turns to read only mode. This is applicable only f you're using Amazon Linux 1 AMIs to launch container instances in your ECS cluster. Amazon Linux 2 AMIs use the Docker overlay2 storage driver,which gives you a base storage size of the space left on your disk. [Batch by default launches Amazon Linux 1 AMIs.]
To increase the default storage allocation for Docker volumes,you need to set the dm.basesize storage option to a value higher than 10 GiB in the Docker daemon configuration file /etc/sysconfig/docker on the container instance. This dm.basesize value can be increased upto your EBS volume size and will allow container/batch job to utilise the full space for execution. After setting the dm.basesize value,any new images that are pulled by Docker use the new storage value that you set. Any containers/batch job that were created or running before you changed the value still use the previous storage value.
To apply the dm.basesize option to all your containers,set the value of the option before the Docker service starts.  You can use a launch template to build a configuration template that applies to all your Amazon Elastic Compute Cloud (Amazon EC2) instances launched by AWS Batch. The following example MIME multi-part file overrides the default Docker image settings for a compute resource:
==============================
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"
#cloud-boothook
#!/bin/bash
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --storage-opt dm.basesize=20G"' >> /etc/sysconfig/docker
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
# Set any ECS agent configuration options
echo ECS_CLUSTER=default>>/etc/ecs/ecs.config
echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
--==BOUNDARY==--
==============================

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)> insert overwrite table dwd_trade_cart_add_inc > select data.id, > data.user_id, > data.course_id, > date_format(
错误1 hive (edu)> insert into huanhuan values(1,'haoge'); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive> show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 <configuration> <property> <name>yarn.nodemanager.res