在 GKE 上,我们遇到了一些 API 随机错误。很多时候我们有“错误拨号后端:EOF”。
我们在 K8s 之上使用 Jenkins 来管理我们的构建。不久前,工作因此错误而被终止:
Executing shell script inside container [protobuf] of pod [kubernetes-bad0aa993add416e80bdc1e66d1b30fc-536045ac8bbe]
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
at com.squareup.okhttp.ws.WebSocketCall.createWebSocket(WebSocketCall.java:123)
at com.squareup.okhttp.ws.WebSocketCall.access$000(WebSocketCall.java:40)
at com.squareup.okhttp.ws.WebSocketCall$1.onResponse(WebSocketCall.java:98)
at com.squareup.okhttp.Call$AsyncCall.execute(Call.java:177)
at com.squareup.okhttp.internal.NamedRunnable.run(NamedRunnable.java:33)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)
这个案例看起来很像:https : //gitlab.com/gitlab-org/gitlab-runner/issues/3247
许多审核日志网址:
permission: "io.k8s.core.v1.pods.exec.create"
resource: "core/v1/namespaces/default/pods/pubsub-6132c0bc-2542-46a2-8041-c865f238698d-4ccc0-c1nkz-lqg5x/exec/pubsub-6132c0bc-2542-46a2-8041-c865f238698d-4ccc0-c1nkz-lqg5x"
Run Code Online (Sandbox Code Playgroud)
和
permission: "io.k8s.core.v1.pods.exec.get"
resource: "core/v1/namespaces/default/pods/pubsub-a5a21f14-0bd1-4338-87b1-8658c3bbc7ad-9gm4n-8nz14/exec"
Run Code Online (Sandbox Code Playgroud)
但我不明白为什么这个错误会出现在 Kubernetes 上......
更新:
可以使用 kube-state-metrics 验证这些错误,其中有 2 个: - ssh_tunnel_open_count - ssh_tunnel_open_fail_count
对我来说,开放隧道 ssh 失败的数量随着 200 多个 ssh 隧道打开而增长。 …