docker swarm 不会重新启动不健康的硒集线器容器

use*_*987 7 selenium docker docker-compose docker-swarm

我已经使用 docker swarm 部署了硒网格。

docker-compose.yml

version: '3.7'

services:
  hub:
   image: selenium/hub:3.141.59-mercury
   ports:
     - "4444:4444"
   volumes:
     - /dev/shm:/dev/shm
   privileged: true
   environment:
     HUB_HOST: hub
     HUB_PORT: 4444
   deploy:
     resources:
       limits:
         memory: 5000M
     restart_policy:
       condition: on-failure
       window: 240s
   healthcheck:
     test: ["CMD", "curl", "-I", "http://127.0.0.1:4444/wd/hub/status"]
     interval: 1m
     timeout: 60s
     retries: 3
     start_period: 300s

  chrome:
    image:  selenium/node-chrome:latest
    volumes:
      - /dev/shm:/dev/shm
    privileged: true
    environment:
      HUB_HOST: hub
      HUB_PORT: 4444
      NODE_MAX_INSTANCES: 5
      NODE_MAX_SESSION: 5
    deploy:
      resources:
        limits:
          memory: 2800M
      replicas: 10
    entrypoint: bash -c 'SE_OPTS="-host $$HOSTNAME" /opt/bin/entry_point.sh'
Run Code Online (Sandbox Code Playgroud)

问题是当hub的状态为 时unhealthy,swarm 几乎从不重启它。有几次我注意到它已成功重新启动。据我所知,它应该一直重启直到healthcheck成功或永远,但是容器只是在运行unhealthy状态。

我尝试restart_policy完全排除,以防它与群模式混淆,但没有效果。

另外:似乎chrome容器(所有副本)在hub成功重新启动后会重新启动。中未指定关系docker-compose.yml,这是怎么发生的?

我的设置可能有什么问题?

更新:

例如,当我尝试检查容器(在状态不正常且不再进行重新启动重试之后docker container inspect $container_id --format '{{json .State.Health}}' | jq .或容器上的几乎任何其他功能时,它会失败并显示以下输出:

docker container inspect 1abfa546cc26 --format '{{json .State.Health}}' | jq .
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fa114765fff m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fa114765fff
stack: frame={sp:0x7ffe5e0f1a08, fp:0x0} stack=[0x7ffe5d8f2fc8,0x7ffe5e0f1ff0)
00007ffe5e0f1908:  73752f3a6e696273  732f3a6e69622f72 
00007ffe5e0f1918:  6e69622f3a6e6962  2a3a36333b30303d 
00007ffe5e0f1928:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f1938:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f1948:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f1958:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f1968:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f1978:  7375706f2e2a3a36  2a3a36333b30303d 
00007ffe5e0f1988:  3b30303d7870732e  0000000000000000 
00007ffe5e0f1998:  3a36333b30303d66  2a3a36333b30303d 
00007ffe5e0f19a8:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f19b8:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f19c8:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f19d8:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f19e8:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f19f8:  7375706f2e2a3a36  0000000000000002 
00007ffe5e0f1a08: <8000000000000006  fffffffe7fffffff 
00007ffe5e0f1a18:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a28:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a38:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a48:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a58:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a68:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a78:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a88:  ffffffffffffffff  00007fa114acd6e0 
00007ffe5e0f1a98:  00007fa11476742a  0000000000000020 
00007ffe5e0f1aa8:  0000000000000000  0000000000000000 
00007ffe5e0f1ab8:  0000000000000000  0000000000000000 
00007ffe5e0f1ac8:  0000000000000000  0000000000000000 
00007ffe5e0f1ad8:  0000000000000000  0000000000000000 
00007ffe5e0f1ae8:  0000000000000000  0000000000000000 
00007ffe5e0f1af8:  0000000000000000  0000000000000000 
runtime: unknown pc 0x7fa114765fff
stack: frame={sp:0x7ffe5e0f1a08, fp:0x0} stack=[0x7ffe5d8f2fc8,0x7ffe5e0f1ff0)
00007ffe5e0f1908:  73752f3a6e696273  732f3a6e69622f72 
00007ffe5e0f1918:  6e69622f3a6e6962  2a3a36333b30303d 
00007ffe5e0f1928:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f1938:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f1948:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f1958:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f1968:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f1978:  7375706f2e2a3a36  2a3a36333b30303d 
00007ffe5e0f1988:  3b30303d7870732e  0000000000000000 
00007ffe5e0f1998:  3a36333b30303d66  2a3a36333b30303d 
00007ffe5e0f19a8:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f19b8:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f19c8:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f19d8:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f19e8:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f19f8:  7375706f2e2a3a36  0000000000000002 
00007ffe5e0f1a08: <8000000000000006  fffffffe7fffffff 
00007ffe5e0f1a18:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a28:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a38:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a48:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a58:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a68:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a78:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a88:  ffffffffffffffff  00007fa114acd6e0 
00007ffe5e0f1a98:  00007fa11476742a  0000000000000020 
00007ffe5e0f1aa8:  0000000000000000  0000000000000000 
00007ffe5e0f1ab8:  0000000000000000  0000000000000000 
00007ffe5e0f1ac8:  0000000000000000  0000000000000000 
00007ffe5e0f1ad8:  0000000000000000  0000000000000000 
00007ffe5e0f1ae8:  0000000000000000  0000000000000000 
00007ffe5e0f1af8:  0000000000000000  0000000000000000 

goroutine 1 [running, locked to thread]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:311 fp=0xc00009c720 sp=0xc00009c718 pc=0x565171ddf910
runtime.newproc(0x565100000000, 0x56517409ab70)
    /usr/local/go/src/runtime/proc.go:3243 +0x71 fp=0xc00009c768 sp=0xc00009c720 pc=0x565171dbdea1
runtime.init.5()
    /usr/local/go/src/runtime/proc.go:239 +0x37 fp=0xc00009c788 sp=0xc00009c768 pc=0x565171db6447
runtime.init()
    <autogenerated>:1 +0x6a fp=0xc00009c798 sp=0xc00009c788 pc=0x565171ddf5ba
runtime.main()
    /usr/local/go/src/runtime/proc.go:147 +0xc2 fp=0xc00009c7e0 sp=0xc00009c798 pc=0x565171db6132
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00009c7e8 sp=0xc00009c7e0 pc=0x565171de1a11

rax    0x0
rbx    0x6
rcx    0x7fa114765fff
rdx    0x0
rdi    0x2
rsi    0x7ffe5e0f1990
rbp    0x5651736b13d5
rsp    0x7ffe5e0f1a08
r8     0x0
r9     0x7ffe5e0f1990
r10    0x8
r11    0x246
r12    0x565175ae21a0
r13    0x11
r14    0x565173654be8
r15    0x0
rip    0x7fa114765fff
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

Run Code Online (Sandbox Code Playgroud)

为了解决它,我确实尝试应用此解决方案:https : //success.docker.com/article/how-to-reserve-resource-temporously-unavailable-errors-due-to-tasksmax-setting

但是它不会影响任何事情,因此,我想原因是不同的。

journalctl -u docker 只是充满了这个日志:

 level=warning msg="Health check for container c427cfd49214d394cee8dd2c9019f6f319bc6637cfb53f0c14de70e1147b5fa6 error: context deadline exceeded"
Run Code Online (Sandbox Code Playgroud)

nis*_*yal 0

对于第一个问题,每当N次重试失败时,Swarm预计会重新启动不健康的容器。如果您想深入了解这一点,请使用以下命令监视 docker 事件

docker events --filter event=health_status
Run Code Online (Sandbox Code Playgroud)

对于第二个问题:- 每当集线器重新启动时,所有节点都会重新启动,这是预期的,因为集线器与所有节点保持会话,并且当您重新启动集线器时,它会重置所有会话并设置新节点。