自 Python 2.7 以来，I/O 是否变慢了？

Question

自 Python 2.7 以来，I/O 是否变慢了？

Mar*_*oma 6 python performance python-2.7 python-3.6 python-3.8

我目前有一个小型项目，我想尽快在我的机器上对一个 20GB 的文件进行排序。这个想法是对文件进行分块，对块进行排序，合并块。我只是用pyenv时间与不同的Python版本，看见基数排序的代码2.7.18是这样的速度比3.6.10，3.7.7，3.8.3和3.9.0a。谁能解释为什么在这个简单的例子中 Python 3.x 比 2.7.18 慢？是否添加了新功能？

import os


def chunk_data(filepath, prefixes):
    """
    Pre-sort and chunk the content of filepath according to the prefixes.

    Parameters
    ----------
    filepath : str
        Path to a text file which should get sorted. Each line contains
        a string which has at least 2 characters and the first two
        characters are guaranteed to be in prefixes
    prefixes : List[str]
    """
    prefix2file = {}
    for prefix in prefixes:
        chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
        prefix2file[prefix] = open(chunk, "w")

    # This is where most of the execution time is spent:
    with open(filepath) as fp:
        for line in fp:
            prefix2file[line[:2]].write(line)

Run Code Online (Sandbox Code Playgroud)

执行时间（多次运行）：

2.7.18：192.2s、220.3s、225.8s
3.6.10：302.5s
3.7.7：308.5s
3.8.3：279.8s、279.7s（二进制模式）、295.3s（二进制模式）、307.7s、380.6s（wtf？）
3.9.0a：292.6s

完整的代码在 Github 上，还有一个最小的完整版本

统一码

是的，我知道 Python 3 和 Python 2 处理字符串的方式不同。我尝试以二进制模式 ( rb/ wb)打开文件，请参阅“二进制模式”注释。它们在几次运行中要快一点。尽管如此，Python 2.7 在所有运行中都快得多。

尝试 1：字典访问

当我提出这个问题时，我认为字典访问可能是造成这种差异的原因。但是，我认为字典访问的总执行时间比 I/O 少得多。此外， timeit 没有显示任何重要的内容：

import timeit
import numpy as np

durations = timeit.repeat(
    'a["b"]',
    repeat=10 ** 6,
    number=1,
    setup="a = {'b': 3, 'c': 4, 'd': 5}"
)

mul = 10 ** -7

print(
    "mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
        np.mean(durations) / mul,
        np.std(durations) / mul
    )
)
print("min  = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max  = {:0.1f} * 10^-7".format(np.max(durations) / mul))

Run Code Online (Sandbox Code Playgroud)

尝试 2：复制时间

作为一个简化的实验，我尝试复制 20GB 的文件：

cp 通过外壳：230s
Python 2.7.18：237s、249s
Python 3.8.3：233s、267s、272s

Python 的东西是由以下代码生成的。

我的第一个想法是方差相当高。所以这可能是原因。但是，chunk_data执行时间的差异也很大，但 Python 2.7 的平均值明显低于 Python 3.x。所以它似乎不是我在这里尝试的那么简单的 I/O 场景。

import time
import sys
import os


version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)


if os.path.isfile("numbers-tmp.txt"):
    os.remove("numers-tmp.txt")

t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
    for line in fin:
        fout.write(line)
t1 = time.time()


print("Python {}: {:0.0f}s".format(version, t1 - t0))

Run Code Online (Sandbox Code Playgroud)

我的系统

Ubuntu 20.04
Thinkpad T460p
Python 通过 pyenv

Answer 1

a_g*_*est 11

这是多种效果的组合，主要是因为 Python 3 在文本模式下工作时需要执行 unicode 解码/编码，如果在二进制模式下工作，它将通过专用的缓冲 IO 实现发送数据。

首先，time.time用于测量执行时间使用挂墙时间，因此包括各种与 Python 无关的事情，例如操作系统级缓存和缓冲，以及存储介质的缓冲。它还反映了对需要存储介质的其他进程的任何干扰。这就是为什么您会看到计时结果出现这些剧烈变化的原因。以下是我的系统的结果，每个版本连续运行七次：

py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6]  # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4]  # 615.84 +/- 15.09

Run Code Online (Sandbox Code Playgroud)

尽管差异很大，但这些结果似乎确实表明了不同的时间，例如可以通过统计测试来确认：

>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636

Run Code Online (Sandbox Code Playgroud)

即只有 2% 的机会出现在同一分布中。

我们可以通过测量过程时间而不是挂墙时间来获得更精确的图片。在 Python 2 中，这可以通过time.clock而 Python 3.3+ 提供time.process_time. 这两个函数报告以下时序：

py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8]  # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4]  # 171.16 +/- 0.16

Run Code Online (Sandbox Code Playgroud)

现在数据的分布要小得多，因为时间只反映了 Python 进程。

该数据表明 Python 3 的执行时间要长约 53.7 秒。鉴于输入文件 ( 550_000_000) 中的大量行，每次迭代大约需要 97.7 纳秒。

导致执行时间增加的第一个影响是 Python 3 中的 unicode 字符串。二进制数据从文件中读取，解码，然后在写回时再次编码。在 Python 2 中，所有字符串都立即存储为二进制字符串，因此这不会引入任何编码/解码开销。您在测试中没有清楚地看到这种效果，因为它在各种外部资源引入的巨大变化中消失了，这些变化反映在墙上的时间差异中。例如，我们可以测量从二进制到 unicode 再到二进制的往返所需的时间：

In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()                     
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Run Code Online (Sandbox Code Playgroud)

这确实包括两个属性查找以及两个函数调用，因此所需的实际时间小于上面报告的值。要查看对执行时间的影响，我们可以将测试脚本更改为使用二进制模式"rb"而"wb"不是文本模式"r"和"w"。这减少了 Python 3 的计时结果，如下所示：

py3_binary_mode = [200.6, 203.0, 207.2]  # 203.60 +/- 2.73

Run Code Online (Sandbox Code Playgroud)

这将每次迭代的处理时间减少了大约 21.3 秒或 38.7 纳秒。这与往返基准的计时结果减去名称查找和函数调用的计时结果一致：

In [2]: class C: 
   ...:     def f(self): pass 
   ...:                                                                                       

In [3]: x = C()                                                                               

In [4]: %timeit x.f()                                                                         
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: %timeit x                                                                             
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

Run Code Online (Sandbox Code Playgroud)

这里%timeit x测量解析全局名称的额外开销，x因此属性查找和函数调用需要82.2 - 17.8 == 64.4几秒钟。从上面的往返数据中减去两次这个开销给出了162 - 2*64.4 == 33.2几秒钟。

现在，使用二进制模式的 Python 3 和 Python 2 之间仍然存在 32.4 秒的差异。这是因为 Python 3 中的所有 IO 都经历了while 在 Python 2 中的（相当复杂的）实现，该方法相当直接地进行到.io.BufferedWriter .writefile.writefwrite

我们可以在两种实现中检查文件对象的类型：

$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>

$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>

Run Code Online (Sandbox Code Playgroud)

这里还需要注意，上面Python 2的时序结果是使用文本模式得到的，不是二进制模式。二进制模式旨在支持实现缓冲区协议的所有对象，这会导致也为字符串执行额外的工作（另请参阅此问题）。如果我们也为 Python 2 切换到二进制模式，那么我们将获得：

py2_binary_mode = [212.9, 213.9, 214.3]  # 213.70 +/- 0.59

Run Code Online (Sandbox Code Playgroud)

这实际上比 Python 3 结果（18.4 ns / 迭代）大一点。

这两种实现方式在其他细节上也有所不同，比如dict实现方式。为了测量这种效果，我们可以创建一个相应的设置：

from __future__ import print_function

import timeit

N = 10**6
R = 7
results = timeit.repeat(
    "d[b'10'].write",
    setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))",  # requires file 'test' to exist
    repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)

Run Code Online (Sandbox Code Playgroud)

这给出了 Python 2 和 Python 3 的以下结果：

Python 2：~ 56.9 纳秒
Python 3：~ 78.1 纳秒

对于完整的 550M 迭代，这个大约 21.2 纳秒的额外差异相当于大约 12 秒。

上面的时序代码只检查了 dict 查找的一个 key，所以我们还需要验证没有哈希冲突：

$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90

Run Code Online (Sandbox Code Playgroud)

哇哦。真的，真的很好的问题和答案。我不知道什么时候会需要这个，但我确信我将来会用它作为参考。 (2认同)

归档时间：	5 年，8 月前
查看次数：	145 次
最近记录：	5 年，8 月前