jra*_*raj 1 amazon-web-services pytorch amazon-sagemaker
我正在尝试将微调 BERT 模型的训练过程容器化,并在 SageMaker 上运行。我计划使用预构建的 SageMaker Pytorch GPU 容器 ( https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/ ) 作为起点,但在提取图像时遇到问题我的构建过程。
我的 Dockerfile 如下所示:
# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
ENV PATH="/opt/ml/code:${PATH}"
# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /bert /opt/ml/code
# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
# this environment variable is used by the SageMaker PyTorch container to determine our program entry point
# for training and serving.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM bert/train
Run Code Online (Sandbox Code Playgroud)
我的 build_and_push 脚本:
#!/usr/bin/env bash
# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.
# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
IMAGE="my-bert"
# parameters
PY_VERSION="py36"
# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)
if [ $? -ne 0 ]
then
exit 255
fi
chmod +x bert/train
# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-2}
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names ${IMAGE} || aws ecr create-repository --repository-name ${IMAGE}
echo "---> repository done.."
# Get the login command from ECR and execute it directly
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $account.dkr.ecr.$region.amazonaws.com
echo "---> logged in to account ecr.."
# Get the login command from ECR in order to pull down the SageMaker PyTorch image
# aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# echo "---> logged in to pytorch ecr.."
echo "Building image with arch=gpu, region=${region}"
TAG="gpu-${PY_VERSION}"
FULLNAME="${account}.dkr.ecr.${region}.amazonaws.com/${IMAGE}:${TAG}"
docker build -t ${IMAGE}:${TAG} --build-arg ARCH="$arch" -f "Dockerfile" .
docker tag ${IMAGE}:${TAG} ${FULLNAME}
docker push ${FULLNAME}
Run Code Online (Sandbox Code Playgroud)
我在推送过程中收到以下消息,并且未提取 sagemaker pytorch 映像:
Get https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.5.0-gpu-py36-cu101-ubuntu16.04: no basic auth credentials
Run Code Online (Sandbox Code Playgroud)
请告诉我这是否是使用预构建 SageMaker 映像的正确方法以及我可以采取哪些措施来修复此错误。
小智 5
在运行 docker build 之前,您应该运行如下命令:
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1993 次 |
| 最近记录: |