我是 Kubernetes 新手。我正在尝试使用 kops 在 AWS 上设置 Kubernetes 集群。我成功地设置了集群。但是,我无法访问仪表板 UI。( https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/#accessing-the-dashboard-ui )
当我访问主节点时,我看到以下错误:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "no endpoints available for service \"kubernetes-dashboard\"",
"reason": "ServiceUnavailable",
"code": 503
}
Run Code Online (Sandbox Code Playgroud)
我看到仪表板的状态为 CrashLoopBackOff。(请注意:我已经删除了以下日志中其他 pod 的名称)
~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kubernetes-dashboard-4167803980-vnx3k 0/1 CrashLoopBackOff 6 6m
$ kubectl logs kubernetes-dashboard-4167803980-vnx3k --namespace=kube-system
2017/09/25 17:50:37 Using in-cluster config to connect to apiserver
2017/09/25 17:50:37 Using service account token for csrf signing
2017/09/25 17:50:37 …Run Code Online (Sandbox Code Playgroud) 我正在尝试将微调 BERT 模型的训练过程容器化,并在 SageMaker 上运行。我计划使用预构建的 SageMaker Pytorch GPU 容器 ( https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/ ) 作为起点,但在提取图像时遇到问题我的构建过程。
我的 Dockerfile 如下所示:
# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
ENV PATH="/opt/ml/code:${PATH}"
# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /bert /opt/ml/code
# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
# this environment variable is used by the SageMaker PyTorch container to determine our …Run Code Online (Sandbox Code Playgroud)