当迭代包含数百万个元素时，是否有 zip(*iterable) 的替代方案？

Question

当迭代包含数百万个元素时，是否有 zip(*iterable) 的替代方案？

Aso*_*cia 9 python optimization python-3.x iterable-unpacking

我遇到过这样的代码：

from random import randint

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = [Point(randint(1, 10), randint(1, 10)) for _ in range(10)]
xs = [point.x for point in points]
ys = [point.y for point in points]

Run Code Online (Sandbox Code Playgroud)

而且我认为这段代码不是Pythonic，因为它会重复。如果将另一个维度添加到Point类中，则需要编写一个全新的循环，如下所示：

zs = [point.z for point in points]

Run Code Online (Sandbox Code Playgroud)

所以我试图通过写这样的东西来使它更像 Pythonic：

xs, ys = zip(*[(point.x, point.y) for point in p])

Run Code Online (Sandbox Code Playgroud)

如果添加了新的维度，没问题：

xs, ys, zs = zip(*[(point.x, point.y, point.z) for point in p])

Run Code Online (Sandbox Code Playgroud)

但是当有数百万个点时，这几乎比其他解决方案慢10 倍，尽管它只有一个循环。我认为这是因为*操作员需要将数百万个参数解压到zip函数中，这很可怕。所以我的问题是：

有没有办法更改上面的代码，使其与以前和Pythonic一样快（不使用 3rd 方库）？

Answer 1

Try*_*yph 8

我刚刚测试了几种压缩Point坐标的方法，并随着点数的增加寻找它们的性能。

以下是我用来测试的功能：

def hardcode(points):
    # a hand crafted comprehension for each coordinate
    return [point.x for point in points], [point.y for point in points]


def using_zip(points):
    # using the "problematic" qip function
    return zip(*((point.x, point.y) for point in points))


def loop_and_comprehension(points):
    # making comprehension from a list of coordinate names
    zipped = []
    for coordinate in ('x', 'y'):
        zipped.append([getattr(point, coordinate) for point in points])
    return zipped


def nested_comprehension(points):
    # making comprehension from a list of coordinate names using nested
    # comprehensions
    return [
        [getattr(point, coordinate) for point in points]
        for coordinate in ('x', 'y')
    ]

Run Code Online (Sandbox Code Playgroud)

使用 timeit 我用不同的点数对每个函数的执行进行计时，结果如下：

comparing processing times using 10 points and 10000000 iterations
hardcode................. 14.12024447 [+0%]
using_zip................ 16.84289724 [+19%]
loop_and_comprehension... 30.83631476 [+118%]
nested_comprehension..... 30.45758349 [+116%]

comparing processing times using 100 points and 1000000 iterations
hardcode................. 9.30594717 [+0%]
using_zip................ 13.74953714 [+48%]
loop_and_comprehension... 19.46766583 [+109%]
nested_comprehension..... 19.27818860 [+107%]

comparing processing times using 1000 points and 100000 iterations
hardcode................. 7.90372457 [+0%]
using_zip................ 12.51523594 [+58%]
loop_and_comprehension... 18.25679913 [+131%]
nested_comprehension..... 18.64352790 [+136%]

comparing processing times using 10000 points and 10000 iterations
hardcode................. 8.27348382 [+0%]
using_zip................ 18.23079485 [+120%]
loop_and_comprehension... 18.00183383 [+118%]
nested_comprehension..... 17.96230063 [+117%]

comparing processing times using 100000 points and 1000 iterations
hardcode................. 9.15848662 [+0%]
using_zip................ 22.70730675 [+148%]
loop_and_comprehension... 17.81126971 [+94%]
nested_comprehension..... 17.86892597 [+95%]

comparing processing times using 1000000 points and 100 iterations
hardcode................. 9.75002857 [+0%]
using_zip................ 23.13891725 [+137%]
loop_and_comprehension... 18.08724660 [+86%]
nested_comprehension..... 18.01269820 [+85%]

comparing processing times using 10000000 points and 10 iterations
hardcode................. 9.96045920 [+0%]
using_zip................ 23.11653558 [+132%]
loop_and_comprehension... 17.98296033 [+81%]
nested_comprehension..... 18.17317708 [+82%]

comparing processing times using 100000000 points and 1 iterations
hardcode................. 64.58698246 [+0%]
using_zip................ 92.53437881 [+43%]
loop_and_comprehension... 73.62493845 [+14%]
nested_comprehension..... 62.99444739 [-2%]

Run Code Online (Sandbox Code Playgroud)

我们可以看到，gettattr随着点数的增加，“硬编码”解决方案与构建理解的解决方案之间的差距似乎在不断缩小。

因此，对于大量点，使用从坐标列表生成的推导式可能是个好主意：

[[getattr(point, coordinate) for point in points]
 for coordinate in ('x', 'y')]

Run Code Online (Sandbox Code Playgroud)

但是，对于少数点来说，这是最糟糕的解决方案（至少从我测试过的解决方案来看）。

有关信息，这是我用来运行此基准测试的代码：

comparing processing times using 10 points and 10000000 iterations
hardcode................. 14.12024447 [+0%]
using_zip................ 16.84289724 [+19%]
loop_and_comprehension... 30.83631476 [+118%]
nested_comprehension..... 30.45758349 [+116%]

comparing processing times using 100 points and 1000000 iterations
hardcode................. 9.30594717 [+0%]
using_zip................ 13.74953714 [+48%]
loop_and_comprehension... 19.46766583 [+109%]
nested_comprehension..... 19.27818860 [+107%]

comparing processing times using 1000 points and 100000 iterations
hardcode................. 7.90372457 [+0%]
using_zip................ 12.51523594 [+58%]
loop_and_comprehension... 18.25679913 [+131%]
nested_comprehension..... 18.64352790 [+136%]

comparing processing times using 10000 points and 10000 iterations
hardcode................. 8.27348382 [+0%]
using_zip................ 18.23079485 [+120%]
loop_and_comprehension... 18.00183383 [+118%]
nested_comprehension..... 17.96230063 [+117%]

comparing processing times using 100000 points and 1000 iterations
hardcode................. 9.15848662 [+0%]
using_zip................ 22.70730675 [+148%]
loop_and_comprehension... 17.81126971 [+94%]
nested_comprehension..... 17.86892597 [+95%]

comparing processing times using 1000000 points and 100 iterations
hardcode................. 9.75002857 [+0%]
using_zip................ 23.13891725 [+137%]
loop_and_comprehension... 18.08724660 [+86%]
nested_comprehension..... 18.01269820 [+85%]

comparing processing times using 10000000 points and 10 iterations
hardcode................. 9.96045920 [+0%]
using_zip................ 23.11653558 [+132%]
loop_and_comprehension... 17.98296033 [+81%]
nested_comprehension..... 18.17317708 [+82%]

comparing processing times using 100000000 points and 1 iterations
hardcode................. 64.58698246 [+0%]
using_zip................ 92.53437881 [+43%]
loop_and_comprehension... 73.62493845 [+14%]
nested_comprehension..... 62.99444739 [-2%]

Run Code Online (Sandbox Code Playgroud)

Answer 2

d.j*_*tta 6

问题zip(*iter)在于它会遍历整个可迭代对象并将结果序列作为 args 传递给 zip。

所以这些在功能上是相同的：

使用 *： xs, ys = zip(*[(p.x, p.y) for p in ((0,1),(0,2),(0,3))])

使用位置： xz, ys = zip((0,1),(0,2),(0,3)) .

显然，如果有数百万个位置参数，这将很慢。

迭代器方法是唯一的解决方法。

我在网上搜索了python itertools unzip. 可悲的是，最接近的itertools获取是tee. 在指向上述要点的链接中，itertools.tee从以下实现返回了一个迭代器元组iunzip：https : //gist.github.com/andrix/106334。

我不得不将其转换为 python3：

from random import randint
import itertools
import time
from operator import itemgetter

def iunzip(iterable):
    """Iunzip is the same as zip(*iter) but returns iterators, instead of 
    expand the iterator. Mostly used for large sequence"""

    _tmp, iterable = itertools.tee(iterable, 2)
    iters = itertools.tee(iterable, len(next(_tmp)))
    return (map(itemgetter(i), it) for i, it in enumerate(iters))

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = [Point(randint(1, 10), randint(1, 10)) for _ in range(1000000)]
itime = time.time()
xs = [point.x for point in points]
ys = [point.y for point in points]
otime = time.time() - itime
itime += otime
print(f"original: {otime}")
xs, ys = zip(*[(p.x, p.y) for p in points])
otime = time.time() - itime
itime += otime
print(f"unpacking into zip: {otime}")
xs, ys = iunzip(((p.x, p.y) for p in points))
for _ in zip(xs, ys): pass
otime = time.time() - itime
itime += otime
print(f"iunzip: {otime}")

Run Code Online (Sandbox Code Playgroud)

输出：

original: 0.1282501220703125
unpacking into zip: 1.286362886428833
iunzip: 0.3046858310699463

Run Code Online (Sandbox Code Playgroud)

所以迭代器绝对比解压到位置参数要好。更不用说当我达到 1000 万点时我的 4GB 内存被吃光了......但是，我不相信iunzip上面的迭代器是最佳的，如果它是一个 python 内置，考虑到迭代两次像在“原始”方法中那样解压缩仍然是迄今为止最快的（尝试使用不同长度的点快约 4 倍）。

好像iunzip应该是一回事。我很惊讶它不是 python 内置程序或 itertools 的一部分......

归档时间：	5 年，10 月前
查看次数：	293 次
最近记录：	5 年，6 月前