具有默认类PV的Pod需要30分钟来升级,以等待磁盘连接

JAl*_*rto 5 google-kubernetes-engine

我部署了一个带1个吊舱和2个容器的头盔图(statefulSet),其中一个容器附加有PV(readwriteonce)。升级时,需要30分钟(7次尝试失败)才能再次启动(因此该服务关闭了30分钟)

一些背景:

  • PV使用默认的GKE类
  • 是GKE区域,每个区域中只有一个节点
  • 即使未强制执行,该广告连播也会在同一节点中再次闪烁(因此,我看不到节点传输)
  • 我在天蓝色的AKS中有一个类似的问题,它也失败了7次,但速度更快,因此停机时间极少,并且涉及节点转移

yaml文件的相关部分:

volumeMounts:
  - mountPath: /app/data
    name: prod-data
Run Code Online (Sandbox Code Playgroud)
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: prod-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 500Gi
      storageClassName: standard
      volumeMode: Filesystem
Run Code Online (Sandbox Code Playgroud)

错误信息:

Unable to mount volumes for pod "foo" timeout expired waiting for volumes to attach or mount for pod "foo". list of unmounted volumes=[foo] list of unattached volumes [foo default-token-foo]
Run Code Online (Sandbox Code Playgroud)

额外的上下文,这是触发StatefulSet升级后发生的情况:

什么都没改变

Name:          prod-data-prod-0
Namespace:     prod
StorageClass:  standard
Status:        Bound
Volume:        pvc-16f49d12-f644-11e9-952a-4201ac100008
Labels:        app=prod
               release=prod
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      500Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    prod-0
Events:        <none>
Run Code Online (Sandbox Code Playgroud)

然后第一个错误

Unable to mount volumes for pod "prod-0_prod(89fb0cf5-0008-11ea-b349-4201ac100009)": timeout expired waiting for volumes to attach or mount for pod "prod"/"prod-0". list of unmounted volumes=[prod-data]. list of unattached volumes=[prod-data default-token-4624v]
Run Code Online (Sandbox Code Playgroud)

还是一样的描述

Name:          prod-data-prod-0
Namespace:     prod
StorageClass:  standard
Status:        Bound
Volume:        pvc-16f49d12-f644-11e9-952a-4201ac100008
Labels:        app=prod
               release=prod
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      500Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    prod-0
Events:        <none>
Run Code Online (Sandbox Code Playgroud)

第二次失败挂载后,这是pod描述

Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  vlapi-prod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prod-data-prod-0
    ReadOnly:   false
  default-token-4624v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4624v
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Run Code Online (Sandbox Code Playgroud)

FailedMount nr 3更改pod描述的pvc描述事件

Events:
  Type     Reason       Age                   From                                             Message
  ----     ------       ----                  ----                                             -------
  Normal   Scheduled    8m44s                 default-scheduler                                Successfully assigned prod/prod-0 to gke-vlgke-a-default-pool-312c60b0-p8lb
  Warning  FailedMount  2m8s (x3 over 6m41s)  kubelet, gke-vlgke-a-default-pool-312c60b0-p8lb  Unable to mount volumes for pod "prod-0_prod(89fb0cf5-0008-11ea-b349-4201ac100009)": timeout expired waiting for volumes to attach or mount for pod "prod"/"prod-0". list of unmounted volumes=[prod-data]. list of unattached volumes=[prod-data default-token-4624v]
Run Code Online (Sandbox Code Playgroud)

警告失败安装48秒(x4超过7分38秒)警告失败安装13秒(x5超过9分17秒)

Name:              pvc-16f49d12-f644-11e9-952a-4201ac100008
Labels:            failure-domain.beta.kubernetes.io/region=europe-west1
                   failure-domain.beta.kubernetes.io/zone=europe-west1-d
Annotations:       kubernetes.io/createdby: gce-pd-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      standard
Status:            Bound
Claim:             prod/prod-data-prod-0
Reclaim Policy:    Retain
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          500Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [europe-west1-d]
                   failure-domain.beta.kubernetes.io/region in [europe-west1]
Message:           
Source:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     gke-vlgke-a-0d42343f-d-pvc-16f49d12-f644-11e9-952a-4201ac100008
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Run Code Online (Sandbox Code Playgroud)

FailedMount 47s(x6 over 12m)FailedMount 11s(x7 over 13m)FailedMount 33s(x8 over 16m)FailedMount 9s(x9 over 18m)FailedMount 0s(x10 over 20m)〜2m在FailedMount超时之间

Events:
  Type     Reason       Age                  From                                             Message
  ----     ------       ----                 ----                                             -------
  Normal   Scheduled    24m                  default-scheduler                                Successfully assigned prod/prod-0 to gke-vlgke-a-default-pool-312c60b0-p8lb
  Warning  FailedMount  2m4s (x10 over 22m)  kubelet, gke-vlgke-a-default-pool-312c60b0-p8lb  Unable to mount volumes for pod "prod-0_prod(89fb0cf5-0008-11ea-b349-4201ac100009)": timeout expired waiting for volumes to attach or mount for pod "prod"/"prod-0". list of unmounted volumes=[prod-data]. list of unattached volumes=[prod-data default-token-4624v]
  Normal   Pulling      11s                  kubelet, gke-gke-default-pool-312c60b0-p8lb  Pulling image "gcr.io/foo-251818/`foo:2019-11-05"
Run Code Online (Sandbox Code Playgroud)

第11次尝试安装的工作没有任何变化,我可以在PVC描述中找到

Cos*_*ntu 1

一种可能性是您的 pod 的 spec.securityContext.runAsUser 和 spec.securityContext.fsGroup 不同于 0(非 root),并且 k8s 会尝试更改卷上所有文件的文件访问权限,这需要一些时间。尝试在 pod 定义中将它们设置为

spec:
  securityContext:
    runAsUser: 0
    fsGroup: 0
Run Code Online (Sandbox Code Playgroud)

其他可能性可能包括 PVC 和 PV 之间的属性(访问模式、容量)不匹配。此外,如果您定义了单个此类 PV,则使用 RWO PVC 引发多个 pod 可能会产生争用。