Bat*_*men 3 amazon-web-services amazon-sagemaker
Aws sagemaker 笔记本实例具有约 104GB 的固定根卷大小,其中约 15GB 是免费的(可用)。
Docker 使用这个临时内存(/var/lib/docker据我所知)。
当我尝试构建 docker 映像来创建自定义训练作业时,使用中的临时根卷会爆炸,系统会抛出“设备上没有剩余空间”错误。
我尝试删除 anaconda 目录(~62 GB),但是随后,boto3 和 sagemaker python 库停止工作。
解决问题的最佳方法是什么?
我尝试构建重型 Dockerfile 来推送 ECR :
ARG REGION="us-east-1"
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
RUN pip3 install torch==1.8.2+cu111 torchvision==0.9.2+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
RUN python3 -m pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.8/index.html
ENV FORCE_CUDA="1"
ENV TORCH_CUDA_ARCH_LIST="Volta"
ENV FVCORE_CACHE="/tmp"
############# SageMaker section ##############
COPY tested_train_src/train_src /opt/ml/code
WORKDIR /opt/ml/code
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM train.py
WORKDIR /
ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]
Run Code Online (Sandbox Code Playgroud)
构建命令:
docker build -t image-name:tag . --build-arg REGION="us-east-1"
Run Code Online (Sandbox Code Playgroud)
docker 构建的输出
Sending build context to Docker daemon 1.935GB
Step 1/12 : ARG REGION="us-east-1"
Step 2/12 : FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
1.8.1-gpu-py36-cu111-ubuntu18.04: Pulling from pytorch-training
d2c87b75: Pulling fs layer
10be24e1: Pulling fs layer
7173dcfe: Pulling fs layer
8de7822d: Pulling fs layer
bf66c36b: Pulling fs layer
c74d4d18: Pulling fs layer
f70a70b2: Pulling fs layer
4e2cb041: Pulling fs layer
8ddd4da6: Pulling fs layer
fac38f0d: Pulling fs layer
a26fd875: Pulling fs layer
1dca51bb: Pulling fs layer
0d6bb6c9: Pulling fs layer
26721764: Pulling fs layer
956fbe7a: Pulling fs layer
ad4fa2a5: Pulling fs layer
20c0bd9a: Pulling fs layer
82804870: Pulling fs layer
1d1fdc54: Pulling fs layer
4500c676: Pulling fs layer
923bbc02: Pulling fs layer
0c9d88c6: Pulling fs layer
f5b0d167: Pulling fs layer
2f2aa1af: Pulling fs layer
c272e0bb: Pulling fs layer
311661aa: Pulling fs layer
ed3ef379: Pulling fs layer
03c2d7ac: Pulling fs layer
1cefc5dc: Pulling fs layer
30fd2377: Pulling fs layer
78d30971: Pulling fs layer
d18f41de: Pulling fs layer
4c2aeed5: Pulling fs layer
f099a687: Pulling fs layer
253573ff: Pulling fs layer
515cab8b: Pulling fs layer
056b70c3: Pulling fs layer
Digest: sha256:66af111d2bd9dae500ad73a7b427103fe8379cbb24bf4ce7cb7d5770d31cd9322KExtracting 505.2MB/962.1MB
Status: Downloaded newer image for 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
---> b4191cf0b8c9
Step 3/12 : RUN pip3 install torch==1.8.2+cu111 torchvision==0.9.2+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
---> Running in 7c62740a69c6
Looking in links: https://download.pytorch.org/whl/lts/1.8/torch_lts.html
Collecting torch==1.8.2+cu111
Downloading https://download.pytorch.org/whl/lts/1.8/cu111/torch-1.8.2%2Bcu111-cp36-cp36m-linux_x86_64.whl (1982.2 MB)
ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
Run Code Online (Sandbox Code Playgroud)
构建前的磁盘使用情况:
sh-4.2$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 1.9G 76K 1.9G 1% /dev
tmpfs 1.9G 0 1.9G 0% /dev/shm
/dev/nvme0n1p1 104G 89G 16G 86% /
/dev/nvme1n1 63G 1.9G 58G 4% /home/ec2-user/SageMaker
Run Code Online (Sandbox Code Playgroud)
错误构建后的磁盘使用情况:
sh-4.2$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 1.9G 76K 1.9G 1% /dev
tmpfs 1.9G 0 1.9G 0% /dev/shm
/dev/nvme0n1p1 104G 101G 2.4G 98% /
/dev/nvme1n1 63G 1.9G 58G 4% /home/ec2-user/SageMaker
Run Code Online (Sandbox Code Playgroud)
注意:我将尝试/var/lib/docker在笔记本启动时将目录安装到 EBS 卷。
注意:我对附加的 EBS 卷大小没有任何问题。我的问题是关于临时卷。
小智 7
我也面临着同样的问题。我按照这篇文章的评论并使用以下命令更改了 docker root 目录。
sudo service docker stop
sudo mv /var/lib/docker /home/ec2-user/SageMaker/docker-data
sudo ln -s /home/ec2-user/SageMaker/docker-data /var/lib/docker
sudo service docker start
Run Code Online (Sandbox Code Playgroud)
有关更改 docker 根目录的更多信息
| 归档时间: |
|
| 查看次数: |
1698 次 |
| 最近记录: |