D. *_*ard 5 google-cloud-platform google-kubernetes-engine
我正在尝试在 GKE 上部署 Autopilot 集群,但是在尝试部署 Pod 时遇到如下所示的 CPU/内存不足错误。Kubectl getnodes 返回 3 个节点,每个节点都有大约 0.5GB 的可用 cpu 和相同的内存,所以非常小。我正在尝试运行 GPU 繁重的作业,因此我希望 GKE 能够扩展,但它没有说资源不足。我究竟做错了什么?
Warning FailedScheduling 27m (x5 over 31m) gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.
Warning FailedScheduling 26m gke.io/optimize-utilization-scheduler 0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 Insufficient cpu, 2 Insufficient memory.
Normal TriggeredScaleUp 26m cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/picdmo-342711/zones/us-central1-c/instanceGroups/gk3-picdmo-nap-1wcisjk4-2ba03e97-grp 0->1 (max: 1000)}]
Normal NotTriggerScaleUp 25m (x6 over 30m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory, 2 in backoff after failed scale-up
Normal NotTriggerScaleUp 20m (x14 over 21m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient cpu, 1 Insufficient memory
Normal TriggeredScaleUp 15m cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/picdmo-342711/zones/us-central1-c/instanceGroups/gk3-picdmo-nap-xt7d8ijc-37c84d94-grp 0->1 (max: 1000)}]
Warning FailedScaleUp 15m (x5 over 31m) cluster-autoscaler Node scale up in zones us-central1-c associated with this pod failed: GCE quota exceeded. Pod is at risk of not being scheduled.
Warning FailedScheduling 15m (x6 over 20m) gke.io/optimize-utilization-scheduler 0/4 nodes are available: 3 Insufficient memory, 4 Insufficient cpu.
Normal NotTriggerScaleUp 14m (x2 over 15m) cluster-autoscaler (combined from similar events): pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 in backoff after failed scale-up, 2 Insufficient cpu, 1 Insufficient memory
Warning FailedScheduling 13m (x2 over 14m) gke.io/optimize-utilization-scheduler 0/4 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1660665555}, that the pod didn't tolerate, 3 Insufficient cpu, 3 Insufficient memory.
Normal NotTriggerScaleUp 4m50s (x135 over 29m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory
Warning FailedScheduling 92s (x17 over 25m) gke.io/optimize-utilization-scheduler 0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory.```
apiVersion: batch/v1
kind: Job
metadata:
generateName: asd-job-
spec:
template:
spec:
containers:
- name: asd
image: gcr.io/-342711/-job:latest
imagePullPolicy: Always
command: ["/bin/sh"]
args: ["-c", "echo"]
resources:
requests:
memory: "16000Mi"
cpu: "8000m"
limits:
memory: "32000Mi"
cpu: "16000m"
nvidia.com/gpu: 2
restartPolicy: Never
backoffLimit: 4
Run Code Online (Sandbox Code Playgroud)
us-central1-c associated with this pod failed: GCE quota exceeded
Run Code Online (Sandbox Code Playgroud)
从第七行开始。
可能是配额问题。检查 IAM 和管理 > 配额
归档时间: |
|
查看次数: |
2152 次 |
最近记录: |