无法在 GKE 中使用 GPU 运行 pod：2 不足 nvidia.com/gpu 错误

Question

无法在 GKE 中使用 GPU 运行 pod：2 不足 nvidia.com/gpu 错误

Oli*_*Oli 8 kubernetes google-kubernetes-engine

我们按照本指南在现有集群中使用支持 GPU 的节点，但是当我们尝试调度 pod 时，我们得到了2 Insufficient nvidia.com/gpu error

细节：

我们正在尝试在现有集群中使用 GPU，为此我们能够成功创建一个 NodePool，其中单个节点启用了 GPU。

然后，下一步根据上面的指南，我们创建一个守护进程集，并且我们也能够成功运行 DS。

但现在，当我们尝试使用以下资源部分来调度 Pod 时，Pod 会因此错误而变得无法调度2 insufficient nvidia.com/gpu

    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: 200m
        memory: 3Gi

Run Code Online (Sandbox Code Playgroud)

眼镜：

Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4

Run Code Online (Sandbox Code Playgroud)

任何进一步调试的帮助或指示将受到高度赞赏。

TIA，

kubectl get node <gpu-node> -o yaml[已编辑]的输出

apiVersion: v1
kind: Node
metadata:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: n1-standard-4
    beta.kubernetes.io/os: linux
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-boot-disk: pd-standard
    cloud.google.com/gke-container-runtime: docker
    cloud.google.com/gke-nodepool: gpu-node
    cloud.google.com/gke-os-distribution: cos
    cloud.google.com/machine-family: n1
    failure-domain.beta.kubernetes.io/region: us-central1
    failure-domain.beta.kubernetes.io/zone: us-central1-b
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: n1-standard-4
    topology.kubernetes.io/region: us-central1
    topology.kubernetes.io/zone: us-central1-b
  name: gke-gpu-node-d6ddf1f6-0d7j
spec:
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: present
status:
  ...
  allocatable:
    attachable-volumes-gce-pd: "127"
    cpu: 3920m
    ephemeral-storage: "133948343114"
    hugepages-2Mi: "0"
    memory: 12670032Ki
    pods: "110"
  capacity:
    attachable-volumes-gce-pd: "127"
    cpu: "4"
    ephemeral-storage: 253696108Ki
    hugepages-2Mi: "0"
    memory: 15369296Ki
    pods: "110"
  conditions:
    ...
  nodeInfo:
    architecture: amd64
    containerRuntimeVersion: docker://19.3.14
    kernelVersion: 5.4.89+
    kubeProxyVersion: v1.18.17-gke.700
    kubeletVersion: v1.18.17-gke.700
    operatingSystem: linux
    osImage: Container-Optimized OS from Google

Run Code Online (Sandbox Code Playgroud)

部署的容忍度

  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Run Code Online (Sandbox Code Playgroud)

Answer 1

hil*_*rat 7

也nvidia-gpu-device-plugin应该安装在 GPU 节点中。您应该在您的命名空间中看到nvidia-gpu-device-pluginDaemonSet kube-system。

它应该由 Google 自动部署，但如果您想自行部署，请运行以下命令：kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

它将在节点中安装 GPU 插件，之后您的 pod 将能够使用它。

我必须手动安装 `nvidia-gpu-device-plugin` DS。不知道为什么它在我们的 GKE 节点中不可用。 (2认同)

归档时间：	4 年，7 月前
查看次数：	5099 次
最近记录：	3 年，7 月前