为什么实习全局字符串值会导致每个多处理进程使用更少的内存？

Question

为什么实习全局字符串值会导致每个多处理进程使用更少的内存？

Rob*_*ing 6 python linux multiprocessing

我有一个 Python 3.6 数据处理任务，它涉及预加载一个大字典，用于按 ID 查找日期，以便在多处理模块管理的子进程池的后续步骤中使用。这个过程占用了盒子上的大部分内存，所以我应用的一项优化是“实习”存储在字典中的字符串日期。正如我预期的那样，这将 dict 的内存占用减少了几个 GB，但它也产生了另一个意想不到的效果。

在应用实习之前，子进程在执行时会逐渐消耗越来越多的内存，我认为这是由于他们不得不将 dict 从全局内存逐渐复制到子进程的单独分配内存（这是运行Linux 等受益于 fork() 的写时复制行为。即使我没有更新子进程中的字典，看起来只读访问仍然可以通过引用计数触发写时复制。

我只希望实习能减少 dict 的内存占用，但实际上它也阻止了内存使用量在子进程生命周期中逐渐增加。

这是我能够构建的一个复制行为的最小示例，尽管它需要一个大文件来加载并填充 dict 以及在值中进行足够数量的重复以确保实习提供好处。

import multiprocessing
import sys

# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
  if len(sys.argv) > 2:
    print('interning is on')
  else:
    print('interning is off')
  for i, line in enumerate(file):
    if i > 30000000:
      break
    parts = line.split('|')
    if len(sys.argv) > 2:
      global_map[str(i)] = sys.intern(parts[2])
    else:
      global_map[str(i)] = parts[2]

def read_map():
  # do some nonsense processing with each value in the dict
  global global_map
  for i in range(30000000):
    x = global_map[str(i)]
  y = x + '_'
  return y

print("starting processes")
process_pool = multiprocessing.Pool(processes=10)

for _ in range(10):
  process_pool.apply_async(read_map)

process_pool.close()

process_pool.join()

Run Code Online (Sandbox Code Playgroud)

我运行了这个脚本并监控htop以查看总内存使用情况。

实习？	打印“启动进程”后的内存使用情况	之后的峰值内存使用量
不	7.1GB	28.0GB
是的	5.5GB	5.6GB

虽然我很高兴这种优化似乎一次解决了我所有的内存问题，但我想更好地理解为什么会这样。如果子进程的内存使用量下降到写时复制，为什么我实习字符串不会发生这种情况？

Answer 1

HTF*_*HTF 4

该CPython实现将内部字符串存储在一个全局对象中，该对象是一个常规的 Python 字典，其中键和值都是指向字符串对象的指针。

当创建新的子进程时，它会获取父进程地址空间的副本，因此它们将使用带有内部字符串的简化数据字典。

我已经使用下面的补丁编译了 Python，如您所见，两个进程都可以访问带有内部字符串的表：

测试.py：

import multiprocessing as mp
import sys
import _string


PROCS = 2
STRING = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"


def worker():
    proc = mp.current_process()
    interned = _string.interned()

    try:
        idx = interned.index(STRING)
    except ValueError:
        s = None
    else:
        s = interned[idx]

    print(f"{proc}: <{s}>")


def main():
    sys.intern(STRING)

    procs = []

    for _ in range(PROCS):
        p = mp.Process(target=worker)
        p.start()
        procs.append(p)

    for p in procs:
        p.join()


if __name__ == "__main__":
    main()

Run Code Online (Sandbox Code Playgroud)

测试：

# python test.py 
<Process name='Process-1' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
<Process name='Process-2' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>

Run Code Online (Sandbox Code Playgroud)

修补：

--- Objects/unicodeobject.c 2021-05-15 15:08:05.117433926 +0100
+++ Objects/unicodeobject.c.tmp 2021-05-15 23:48:35.236152366 +0100
@@ -16230,6 +16230,11 @@
     _PyUnicode_FiniEncodings(&tstate->interp->unicode.fs_codec);
 }
 
+static PyObject *
+interned_impl(PyObject *module)
+{
+    return PyDict_Values(interned);
+}
 
 /* A _string module, to export formatter_parser and formatter_field_name_split
    to the string.Formatter class implemented in Python. */
@@ -16239,6 +16244,8 @@
      METH_O, PyDoc_STR("split the argument as a field name")},
     {"formatter_parser", (PyCFunction) formatter_parser,
      METH_O, PyDoc_STR("parse the argument as a format string")},
+    {"interned", (PyCFunction) interned_impl,
+     METH_NOARGS, PyDoc_STR("lookup interned strings")},
     {NULL, NULL}
 };

Run Code Online (Sandbox Code Playgroud)

您可能还想看看共享内存模块。

参考：

Python 字符串驻留的内部原理

归档时间：	4 年，4 月前
查看次数：	233 次
最近记录：	4 年，3 月前