Python 的多进程池中是否存在一个错误,导致失效进程处于 spawn 模式?

Vik*_*tor 5 python multiprocessing python-3.x python-multiprocessing

我们注意到在我们的一个部署中遗留了一堆已失效的(僵尸)进程,并设法生成了一个显示问题的非常小的程序:

多.py

from multiprocessing import Pool, set_start_method

def f(x):
    return x*x

if __name__ == '__main__':
    set_start_method('spawn')
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))
        p.close()
        p.join()
Run Code Online (Sandbox Code Playgroud)

该程序似乎正在离开僵尸进程,但很难捕获,因为从常规 shell 运行它会导致 shell 收割僵尸进程。

在我们的部署中,我们从另一个 python 程序运行它,所以为了模拟它,我们有这个:

主要.py

from subprocess import run
from time import sleep

while True:
    result = run(["python", "multi.py"], capture_output=True)
    print(result.stdout.decode('utf-8'))
    result = run(["ps", "-ef", "--forest"], capture_output=True)
    print(result.stdout.decode('utf-8'), flush=True)
    sleep(1)
Run Code Online (Sandbox Code Playgroud)

运行main.py会产生以下输出:

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0 11 11:33 pts/0    00:00:00 python main.py
root         8     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        17     1  0 11:33 pts/0    00:00:00 ps -ef --forest

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  6 11:33 pts/0    00:00:00 python main.py
root         8     1  3 11:33 pts/0    00:00:00 [python] <defunct>
root        19     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        28     1  0 11:33 pts/0    00:00:00 ps -ef --forest

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  4 11:33 pts/0    00:00:00 python main.py
root         8     1  1 11:33 pts/0    00:00:00 [python] <defunct>
root        19     1  3 11:33 pts/0    00:00:00 [python] <defunct>
root        30     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        39     1  0 11:33 pts/0    00:00:00 ps -ef --forest

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  3 11:33 pts/0    00:00:00 python main.py
root         8     1  1 11:33 pts/0    00:00:00 [python] <defunct>
root        19     1  1 11:33 pts/0    00:00:00 [python] <defunct>
root        30     1  4 11:33 pts/0    00:00:00 [python] <defunct>
root        41     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        50     1  0 11:33 pts/0    00:00:00 ps -ef --forest
Run Code Online (Sandbox Code Playgroud)

另一方面,以下程序不会产生失效的进程:

主 sig.py

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0 11 11:33 pts/0    00:00:00 python main.py
root         8     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        17     1  0 11:33 pts/0    00:00:00 ps -ef --forest

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  6 11:33 pts/0    00:00:00 python main.py
root         8     1  3 11:33 pts/0    00:00:00 [python] <defunct>
root        19     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        28     1  0 11:33 pts/0    00:00:00 ps -ef --forest

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  4 11:33 pts/0    00:00:00 python main.py
root         8     1  1 11:33 pts/0    00:00:00 [python] <defunct>
root        19     1  3 11:33 pts/0    00:00:00 [python] <defunct>
root        30     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        39     1  0 11:33 pts/0    00:00:00 ps -ef --forest

[1, 4, 9]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  3 11:33 pts/0    00:00:00 python main.py
root         8     1  1 11:33 pts/0    00:00:00 [python] <defunct>
root        19     1  1 11:33 pts/0    00:00:00 [python] <defunct>
root        30     1  4 11:33 pts/0    00:00:00 [python] <defunct>
root        41     1  0 11:33 pts/0    00:00:00 [python] <defunct>
root        50     1  0 11:33 pts/0    00:00:00 ps -ef --forest
Run Code Online (Sandbox Code Playgroud)

此外,以下简单的 shell 脚本deo 不会产生僵尸:

from os import wait
import signal
from subprocess import run
from time import sleep

def chld_handler(_signum, _frame):
    wait()

signal.signal(signal.SIGCHLD, chld_handler)

while True:
    result = run(["python", "multi.py"], capture_output=True)
    print(result.stdout.decode('utf-8'))
    result = run(["ps", "-ef", "--forest"], capture_output=True)
    print(result.stdout.decode('utf-8'), flush=True)
    sleep(1)
Run Code Online (Sandbox Code Playgroud)

这是 Python 中的错误还是您需要处理来自子进程的任何僵尸(就像 Bash 似乎在做的那样)?

Dockerfile可在此处获得所有代码和轻松重现问题的代码:https : //github.com/viktorvia/python-multi-issue

该问题可在 Python 3.9.6、3.7.4 和 3.7.11 中重现。