当“OOMKilled”时,对 137 上的 docker Exit 进行故障排除: false

Dan*_*nny 5 python linux docker vcenter docker-compose

我做了什么

\n
    \n
  1. 在AlmaLinux服务器上启动服务docker-compose up
  2. \n
  3. 注意到输出有docker-compose logs一段时间没有变化
  4. \n
  5. 查看docker-compose ps
  6. \n
\n
$ docker-compose ps\n              Name                            Command                State     Ports\n------------------------------------------------------------------------------------\nmysupercoolsystem_api_1           python -m mysupercoolsyste ...   Exit 137\nmysupercoolsystem_dev_1           sh -c jupyter lab --ip=0.0 ...   Exit 137\nmysupercoolsystem_loader_1        /bin/sh -c python -m mysup ...   Exit 137\nmysupercoolsystem_predictor_1     /bin/sh -c python -m mysup ...   Exit 137\nmysupercoolsystem_trainer_1       /bin/sh -c python -m mysup ...   Exit 137\n\n\n$ docker ps -a  # just to confirm\n72708f3450   hub.nic.dk/nicecompany/mysupercoolsystem   "/bin/sh -c \'python \xe2\x80\xa6"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_trainer_1\n3e286cabb0   jupyter/scipy-notebook:33add21fab64        "sh -c \'jupyter lab \xe2\x80\xa6"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_dev_1\n246b87f0ac   hub.nic.dk/nicecompany/mysupercoolsystem   "/bin/sh -c \'python \xe2\x80\xa6"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_predictor_1\n7d3297092c   hub.nic.dk/nicecompany/mysupercoolsystem   "python -m mysuperc \xe2\x80\xa6"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_api_1\n2a07851f9c   hub.nic.dk/nicecompany/mysupercoolsystem   "/bin/sh -c \'python \xe2\x80\xa6"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_loader_1\n\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  1. 研究容器是否因内存不足而停止\n
      \n
    • 检查虚拟主机:docker 容器在单个虚拟(vcenter 管理)主机上运行。主机分配了 20GB 的 RAM,vcenter 监视器显示 RAM 使用峰值约为 10 秒。8GB 不能更多。
    • \n
    • 后续:与系统管理员交谈:服务器没有重新启动或明确要求终止任何进程。
    • \n
    • docker info | grep Memory回报Total Memory: 19.37GiB
    • \n
    • 检查每个容器都docker inspect <container_id>给出相同的值,除了随秒变化的"State"字段之外。"FinishedAt"\xc2\xb10.05
    • \n
    \n
  2. \n
\n
"State": {\n  "Status": "exited",\n  "Running": false,\n  "Paused": false,\n  "Restarting": false,\n  "OOMKilled": false,\n  "Dead": false,\n  "Pid": 0,\n  "ExitCode": 137,\n  "Error": "",\n  "StartedAt": "2021-11-13T10:33:04.785566471Z",\n  "FinishedAt": "2021-11-13T10:33:57.1xxxxZ"\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  1. 重新检查了我的docker-compose.yml.
  2. \n
\n
$ cat docker-compose.yml\nversion: "3"\nservices:\n  dev:\n    image: jupyter/scipy-notebook:33add21fab64\n    environment:\n      - COMPONENT=develop\n    volumes:\n      - /opt/mysupercoolsystem:/home/jovyan/work\n      - /media:/media\n    ports:\n      - "3333:3333"\n    entrypoint: sh -c "jupyter lab --ip=0.0.0.0 --port=3333 --no-browser --allow-root"\n\n  loader:\n    image: hub.nic.com/nicecompany/mysupercoolsystem\n    working_dir: "/app"\n    volumes:\n      - /media:/media\n\n  trainer:\n    image: hub.nic.dk/nicecompany/mysupercoolsystem\n    environment:\n      - COMPONENT=train\n    working_dir: "/app"\n    volumes:\n      - models:/models\n\n  predictor:\n    image: hub.nic.dk/nicecompany/mysupercoolsystem\n    environment:\n      - COMPONENT=pred\n    working_dir: "/app"\n    volumes:\n      - models:/models\n\n  api:\n    image: hub.nic.dk/nicecompany/mysupercoolsystem\n    environment:\n      - COMPONENT=api\n    working_dir: "/app"\n    ports:\n      - "69:69"\n    entrypoint: python -m mysupercoolsystem.web_api\n\nvolumes:\n  models:\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  1. 检查Dockerfile。注意:没有显式入口点的服务docker-compose.ymlDockerfile.
  2. \n
\n
$ cat mysupercoolsystem/Dockerfile\nFROM python:3.8\nWORKDIR /app\nCOPY ./requirements.txt /app/requirements.txt\nRUN pip install -r requirements.txt\nCOPY . /app\nRUN pip install .\nENTRYPOINT python -m mysupercoolsystem\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  1. 检查了类似问题(此问题的--abort-on-container-exit罪魁祸首是 -flag。我没有使用任何标志)。
  2. \n
\n

如何进行

\n
    \n
  • 为什么服务会退出?
  • \n
  • 我该如何解决该错误?
  • \n
  • 我还应该检查其他日志吗?
  • \n
  • 如果我添加restart: unless-stopped每个服务,除了我自己的日志记录之外,还有什么方法可以检查 docker 服务退出吗docker logs
  • \n
\n

Ita*_*ing 0

您可以使用https://pythonspeed.com/fil/调试 Python 中的内存不足错误(请参阅https://pythonspeed.com/articles/crash-out-of-memory/)。

  • Docker 容器在单个虚拟(vcenter 管理)主机上运行。主机分配有 20GB RAM,vcenter 监控显示 RAM 使用峰值为 8GB,绝不会超过。日志还表明 OOM 不是问题:`"OOMKilled": false`。另外,程序不应该退出——而是在 while-true-do-sleep 循环中运行。 (2认同)