尝试使用 EFS 在 AWS EKS(仅限 Fargate)上运行 Prometheus 时出现权限错误

jmk*_*ite 1 amazon-web-services kubernetes prometheus amazon-efs amazon-eks

我有一个仅限 Fargate 的 EKS 集群。我真的不想自己管理实例。我想将 Prometheus 部署到它 - 这需要一个持久卷。截至两个月前,这应该可以使用 EFS(托管 NFS 共享)我觉得我快到了,但我无法弄清楚当前的问题是什么

我做了什么:

  • 设置 EKS fargate 集群和合适的 fargate 配置文件
  • 使用适当的安全组设置 EFS
  • 根据AWS 演练安装 CSI 驱动程序并验证 EFS

到目前为止一切都很好

我设置了持久卷声明(我知道必须静态完成):

kubectl apply -f pvc/
Run Code Online (Sandbox Code Playgroud)

在哪里

tree pvc/
pvc/
??? two_pvc.yml
??? ten_pvc.yml
Run Code Online (Sandbox Code Playgroud)

cat pvc/*

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-two
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-ten
spec:
  capacity:
    storage: 8Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234
Run Code Online (Sandbox Code Playgroud)

然后

helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
    --namespace prometheus \
    --set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"
Run Code Online (Sandbox Code Playgroud)

发生什么了?

prometheus alertmanager 的 pvc 效果很好。此部署的其他 pod 也是如此,但 prometheus 服务器使用 crashloopbackoff

invalid capacity 0 on filesystem
Run Code Online (Sandbox Code Playgroud)

诊断

kubectl get pv -A
NAME                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                               STORAGECLASS   REASON   AGE
efs-pv-ten                    8Gi        RWO            Retain           Bound      prometheus/myrelease-helm-02-prometheus-server         efs-sc                  11m
efs-pv-two                    2Gi        RWO            Retain           Bound      prometheus/myrelease-helm-02-prometheus-alertmanager   efs-sc                  11m
Run Code Online (Sandbox Code Playgroud)

kubectl get pvc -A
NAMESPACE    NAME                                     STATUS   VOLUME       CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus   myrelease-helm-02-prometheus-alertmanager   Bound    efs-pv-two   2Gi        RWO            efs-sc         12m
prometheus   myrelease-helm-02-prometheus-server         Bound    efs-pv-ten   8Gi        RWO            efs-sc         12m
Run Code Online (Sandbox Code Playgroud)

describe pod 只显示“错误”

最后,这个(来自同事):

level=info ts=2020-10-09T15:17:08.898Z caller=main.go:346 msg="Starting Prometheus" version="(version=2.21.0, branch=HEAD, revision=e83ef207b6c2398919b69cd87d2693cfc2fb4127)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:347 build_context="(go=go1.15.2, user=root@a4d9bea8479e, date=20200911-11:35:02)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:348 host_details="(Linux 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 myrelease-helm-02-prometheus-server-85765f9895-vxrkn (none))"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:349 fd_limits="(soft=1024, hard=4096)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:350 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2020-10-09T15:17:08.901Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/data/queries.active err="open /data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7fffeb6e85ee, 0x5, 0x14, 0x30ca080, 0xc000d43620, 0x30ca080)
    /app/promql/query_logger.go:117 +0x4cf
main.main()
    /app/cmd/prometheus/main.go:377 +0x510c
Run Code Online (Sandbox Code Playgroud)

除了出现权限问题之外,我感到困惑 - 我知道存储“有效”并且可以访问 - 部署中的另一个 pod 似乎对此很满意 - 但不是这个。

jmk*_*ite 5

现在工作 - 并为了共同利益在这里写作。感谢reddit 上的 /u/EmiiKhaos建议去哪里找

问题:

EFS 共享是root:root唯一的,prometheus 禁止以 root 身份运行 pod。

解决方案:

  • 为每个需要持久卷的 pod 创建一个 EFS 访问点,以允许指定用户访问。
  • 为持久卷指定这些访问点
  • 应用合适的安全上下文以匹配用户身份运行 pod

方法:

创建 2 个 EFS 接入点,例如:

{
    "Name": "prometheuserver",
    "AccessPointId": "fsap-<hex01>",
    "FileSystemId": "fs-ec0e1234",
    "PosixUser": {
        "Uid": 500,
        "Gid": 500,
        "SecondaryGids": [
            2000
        ]
    },
    "RootDirectory": {
        "Path": "/prometheuserver",
        "CreationInfo": {
            "OwnerUid": 500,
            "OwnerGid": 500,
            "Permissions": "0755"
        }
    }
},
{
    "Name": "prometheusalertmanager",
    "AccessPointId": "fsap-<hex02>",
    "FileSystemId": "fs-ec0e1234",
    "PosixUser": {
        "Uid": 501,
        "Gid": 501,
        "SecondaryGids": [
            2000
        ]
    },
    "RootDirectory": {
        "Path": "/prometheusalertmanager",
        "CreationInfo": {
            "OwnerUid": 501,
            "OwnerGid": 501,
            "Permissions": "0755"
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

更新我的持久卷:

kubectl apply -f pvc/
Run Code Online (Sandbox Code Playgroud)

类似于:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheusalertmanager
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234::fsap-<hex02>
---    
apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheusserver
spec:
  capacity:
    storage: 8Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234::fsap-<hex01>
Run Code Online (Sandbox Code Playgroud)

像以前一样重新安装prometheus:

helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
    --namespace prometheus \
    --set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"
Run Code Online (Sandbox Code Playgroud)

从有根据的猜测

kubectl describe pod myrelease-helm-02-prometheus-server -n prometheus
Run Code Online (Sandbox Code Playgroud)

kubectl describe pod myrelease-helm-02-prometheus-alert-manager -n prometheus
Run Code Online (Sandbox Code Playgroud)

在设置安全上下文时需要指定哪个容器。然后应用安全上下文以适当的方式运行 pod uid:gid,例如使用

kubectl apply -f setpermissions/
Run Code Online (Sandbox Code Playgroud)

在哪里

cat setpermissions/*
Run Code Online (Sandbox Code Playgroud)

apiVersion: v1
kind: Pod
metadata:
  name: myrelease-helm-02-prometheus-alertmanager
spec:
  securityContext:
    runAsUser: 501
    runAsGroup: 501
    fsGroup: 501
  volumes:
    - name: prometheusalertmanager
  containers:
    - name: prometheusalertmanager
      image: jimmidyson/configmap-reload:v0.4.0
      securityContext:
        runAsUser: 501
        allowPrivilegeEscalation: false        
apiVersion: v1
kind: Pod
metadata:
  name: myrelease-helm-02-prometheus-server
spec:
  securityContext:
    runAsUser: 500
    runAsGroup: 500
    fsGroup: 500
  volumes:
    - name: prometheusserver
  containers:
    - name: prometheusserver
      image: jimmidyson/configmap-reload:v0.4.0
      securityContext:
        runAsUser: 500
        allowPrivilegeEscalation: false
Run Code Online (Sandbox Code Playgroud)