无法在 GKE 中使用 GPU 运行 pod:2 不足 nvidia.com/gpu 错误

Oli*_*Oli 8 kubernetes google-kubernetes-engine

我们按照本指南在现有集群中使用支持 GPU 的节点,但是当我们尝试调度 pod 时,我们得到了2 Insufficient nvidia.com/gpu error

细节:

我们正在尝试在现有集群中使用 GPU,为此我们能够成功创建一个 NodePool,其中单个节点启用了 GPU。

然后,下一步根据上面的指南,我们创建一个守护进程集,并且我们也能够成功运行 DS。

但现在,当我们尝试使用以下资源部分来调度 Pod 时,Pod 会因此错误而变得无法调度2 insufficient nvidia.com/gpu

    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: 200m
        memory: 3Gi
Run Code Online (Sandbox Code Playgroud)

眼镜:

Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4
Run Code Online (Sandbox Code Playgroud)

任何进一步调试的帮助或指示将受到高度赞赏。

TIA,


kubectl get node <gpu-node> -o yaml[已编辑]的输出

apiVersion: v1
kind: Node
metadata:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: n1-standard-4
    beta.kubernetes.io/os: linux
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-boot-disk: pd-standard
    cloud.google.com/gke-container-runtime: docker
    cloud.google.com/gke-nodepool: gpu-node
    cloud.google.com/gke-os-distribution: cos
    cloud.google.com/machine-family: n1
    failure-domain.beta.kubernetes.io/region: us-central1
    failure-domain.beta.kubernetes.io/zone: us-central1-b
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: n1-standard-4
    topology.kubernetes.io/region: us-central1
    topology.kubernetes.io/zone: us-central1-b
  name: gke-gpu-node-d6ddf1f6-0d7j
spec:
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: present
status:
  ...
  allocatable:
    attachable-volumes-gce-pd: "127"
    cpu: 3920m
    ephemeral-storage: "133948343114"
    hugepages-2Mi: "0"
    memory: 12670032Ki
    pods: "110"
  capacity:
    attachable-volumes-gce-pd: "127"
    cpu: "4"
    ephemeral-storage: 253696108Ki
    hugepages-2Mi: "0"
    memory: 15369296Ki
    pods: "110"
  conditions:
    ...
  nodeInfo:
    architecture: amd64
    containerRuntimeVersion: docker://19.3.14
    kernelVersion: 5.4.89+
    kubeProxyVersion: v1.18.17-gke.700
    kubeletVersion: v1.18.17-gke.700
    operatingSystem: linux
    osImage: Container-Optimized OS from Google
Run Code Online (Sandbox Code Playgroud)

部署的容忍度

  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
Run Code Online (Sandbox Code Playgroud)

hil*_*rat 7

nvidia-gpu-device-plugin应该安装在 GPU 节点中。您应该在您的命名空间中看到nvidia-gpu-device-pluginDaemonSet kube-system

它应该由 Google 自动部署,但如果您想自行部署,请运行以下命令:kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

它将在节点中安装 GPU 插件,之后您的 pod 将能够使用它。

  • 我必须手动安装 `nvidia-gpu-device-plugin` DS。不知道为什么它在我们的 GKE 节点中不可用。 (2认同)