新的 GKE 集群记录了数千个错误

Question

新的 GKE 集群记录了数千个错误

在 Google Kubernetes Engine 中创建全新的 Kubernetes 集群后，我在 Google Cloud 日志记录中看到许多与指标代理相关的错误。

我在版本上的现有集群上遇到了这个问题1.18.x。1.19.x然后我在建议后升级到这可以修复它。但是，问题仍然存在，所以我升级了1.20.x，但仍然没有任何变化。

最终，我使用最新的 Kubernetes 版本创建了一个新集群，但仍然看到之后立即记录了数百个错误：

gcloud beta container clusters create "my-cluster-1" \
    --project "my-project-1" \
    --zone "europe-west2-a" \
    --no-enable-basic-auth \
    --release-channel "rapid" \
    --cluster-version "1.20.2-gke.2500" \
    --machine-type "e2-standard-2" \
    --image-type "COS_CONTAINERD" \
    --disk-type "pd-standard" \
    --disk-size "100" \
    --metadata disable-legacy-endpoints=true \
    --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
    --num-nodes "1" \
    --enable-stackdriver-kubernetes \
    --enable-private-nodes \
    --master-ipv4-cidr "172.16.0.0/28" \
    --enable-ip-alias \
    --network "projects/my-project-1/global/networks/default" \
    --subnetwork "projects/my-project-1/regions/europe-west2/subnetworks/default" \
    --default-max-pods-per-node "110" \
    --no-enable-master-authorized-networks \
    --addons HorizontalPodAutoscaling,HttpLoadBalancing,NodeLocalDNS,GcePersistentDiskCsiDriver \
    --enable-autoupgrade \
    --enable-autorepair \
    --max-surge-upgrade 1 \
    --max-unavailable-upgrade 0 \
    --workload-pool "my-project-1.svc.id.goog" \
    --enable-shielded-nodes \
    --node-locations "europe-west2-a","europe-west2-b","europe-west2-c"

Run Code Online (Sandbox Code Playgroud)

在 Google Cloud 日志记录中，我使用以下查询检查错误：

severity=ERROR
AND (resource.labels.container_name:"gke-metrics-agent"
OR resource.labels.container_name="metrics-server-nanny")
resource.labels.cluster_name="my-cluster-1"

Run Code Online (Sandbox Code Playgroud)

根据另一个建议，我等待了 10 多分钟，但仍然收到相同数量的错误记录：

2021 年 3 月 5 日更新

通过 UI 创建新的测试集群。除了根据建议将集群名称设置为test-cluster-1、区域设置为europe-west-2a以及 Kubernetes 版本设置为最新的快速通道之外，不更改任何内容：

创建新集群后，我立即记录了数百个错误：

我会观察 15-20 分钟，看看是否仍然如此。

Answer 1

Pjo*_*erS 2

正如之前的线程中提到的，GKE clusterv1.18.12-gke.1206包含错误，记录了数百个Prometheus错误：

github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport

Run Code Online (Sandbox Code Playgroud)

此问题已通过报告Issue Tracker。此问题已在版本1.18.14-gke.1200+和中得到修复1.19.6-gke.600+。具有上述版本或更高版本的新集群包含此问题的修复。

OP 的集群配置包含一个导致此问题再次发生的标志。我测试了一些场景，但 OP @dustinmoris 发现这是由NodeLocalDNS插件引起的。

启用一个插件：NodeLocalDNS再次出现该问题。它已在以下版本上进行了测试：1.20.2-gke.2500、1.19.7-gke.1500、1.19.7-gke.2503、1.18.15-gke.1102。

正确的评论已添加到中Issue Tracker。对于所有更新，请查看此问题跟踪器。

归档时间：	4 年，10 月前
查看次数：	1035 次
最近记录：	4 年，10 月前