Oli*_*Oli 8 kubernetes google-kubernetes-engine
我们按照本指南在现有集群中使用支持 GPU 的节点,但是当我们尝试调度 pod 时,我们得到了2 Insufficient nvidia.com/gpu error
细节:
我们正在尝试在现有集群中使用 GPU,为此我们能够成功创建一个 NodePool,其中单个节点启用了 GPU。
然后,下一步根据上面的指南,我们创建一个守护进程集,并且我们也能够成功运行 DS。
但现在,当我们尝试使用以下资源部分来调度 Pod 时,Pod 会因此错误而变得无法调度2 insufficient nvidia.com/gpu
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: 200m
memory: 3Gi
Run Code Online (Sandbox Code Playgroud)
眼镜:
Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4
Run Code Online (Sandbox Code Playgroud)
任何进一步调试的帮助或指示将受到高度赞赏。
TIA,
kubectl get node <gpu-node> -o yaml[已编辑]的输出
apiVersion: v1
kind: Node
metadata:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: n1-standard-4
beta.kubernetes.io/os: linux
cloud.google.com/gke-accelerator: nvidia-tesla-t4
cloud.google.com/gke-boot-disk: pd-standard
cloud.google.com/gke-container-runtime: docker
cloud.google.com/gke-nodepool: gpu-node
cloud.google.com/gke-os-distribution: cos
cloud.google.com/machine-family: n1
failure-domain.beta.kubernetes.io/region: us-central1
failure-domain.beta.kubernetes.io/zone: us-central1-b
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: n1-standard-4
topology.kubernetes.io/region: us-central1
topology.kubernetes.io/zone: us-central1-b
name: gke-gpu-node-d6ddf1f6-0d7j
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: present
status:
...
allocatable:
attachable-volumes-gce-pd: "127"
cpu: 3920m
ephemeral-storage: "133948343114"
hugepages-2Mi: "0"
memory: 12670032Ki
pods: "110"
capacity:
attachable-volumes-gce-pd: "127"
cpu: "4"
ephemeral-storage: 253696108Ki
hugepages-2Mi: "0"
memory: 15369296Ki
pods: "110"
conditions:
...
nodeInfo:
architecture: amd64
containerRuntimeVersion: docker://19.3.14
kernelVersion: 5.4.89+
kubeProxyVersion: v1.18.17-gke.700
kubeletVersion: v1.18.17-gke.700
operatingSystem: linux
osImage: Container-Optimized OS from Google
Run Code Online (Sandbox Code Playgroud)
部署的容忍度
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
Run Code Online (Sandbox Code Playgroud)
也nvidia-gpu-device-plugin应该安装在 GPU 节点中。您应该在您的命名空间中看到nvidia-gpu-device-pluginDaemonSet kube-system。
它应该由 Google 自动部署,但如果您想自行部署,请运行以下命令:kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
它将在节点中安装 GPU 插件,之后您的 pod 将能够使用它。
| 归档时间: |
|
| 查看次数: |
5099 次 |
| 最近记录: |