创建一个Python生成器,从两个大列表中生成有序的整数乘积

Question

创建一个Python生成器,从两个大列表中生成有序的整数乘积

M. *_*rer 4 python iterator generator python-3.x

所以,我有两个非常大的数字列表l1和l2.我想将每个元素的l1每一个元素相乘,l2 而不是明确地创建一个新的产品列表.因此,我想要一台发电机.这部分很容易.我可以做点什么

for a in l1:
    for b in l2:
        yield a * b

Run Code Online (Sandbox Code Playgroud)

但是,我还需要按照它们的大小来订购这些产品.我想知道是否有一些聪明的技巧来订购yield语句,这样也可以使用生成器来完成.在Python 3中,如果可能的话.谢谢.

Answer 1

Tim*_*ers 8

我会打电话给名单xs和ys,并承担他们排序.正如你在评论中指出的那样,最小的产品是必然的xs[0] * ys[0]- 但只有你还假设所有数字都是非负的,所以我也会假设.

在第一个产品之后,它变得更加混乱 - 否则你已经解决了它;-)接下来要考虑的是xs[0] * ys[1]和xs[1] * ys[0].很容易,但接下来考虑取决于哪些赢了.如果xs[0] * ys[1]赢了,你只需要替换它xs[0] * ys[2],但如果xs[1] * ys[0]赢了,那么两个xs[1] * ys[1]并且xs[2] * ys[0]发挥作用.等等.

以下内容通过堆跟踪不断增长的可能性.堆永远不会包含多个len(xs)项,因此代码首先安排制作xs更短的列表:

def upprod(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 4-tuples:
    #     (product, xs index, ys index, xs[xs index])
    h = [(xs[0] * ys[0], 0, 0, xs[0])]
    while h:
        prod, xi, yi, x = heappop(h)
        yield prod
        # same x with next y
        yi += 1
        if yi < lenys:
            heappush(h, (x * ys[yi], xi, yi, x))
        # if this is the first time we used x, start
        # the next x going
        if yi == 1:
            xi += 1
            if xi < lenxs:
                x = xs[xi]
                heappush(h, (x * ys[0], xi, 0, x))

Run Code Online (Sandbox Code Playgroud)

如果存在一种本质上更有效的解决方案,我会感到惊喜.如果有人认为他们有一个,请先使用这个随机测试器尝试:

from itertools import product
from random import randrange
MAXLEN = 10
UB = 1000
ntest = 0
while True:
    ntest += 1
    lenxs = randrange(MAXLEN + 1)
    lenys = randrange(MAXLEN + 1)
    xs = sorted(randrange(UB) for i in range(lenxs))
    ys = sorted(randrange(UB) for i in range(lenys))
    brute = sorted(a*b for a, b in product(xs, ys))
    got = list(upprod(xs, ys))
    if brute != got:
        print("OUCH!")
        print(xs)
        print(ys)
        print(brute)
        print(got)
        assert False
    if ntest % 10_000 == 0:
        print(f"finished test {ntest:,}")

Run Code Online (Sandbox Code Playgroud)

编辑 - 在某些意义上理论上更好;-)

上面没有充分利用我们可以单独从索引中推导出的偏序:if

i1 <= i2 and j1 <= j2

Run Code Online (Sandbox Code Playgroud)

然后我们知道

xs[i1] * ys[j1] <= xs[i2] * ys[j2]

Run Code Online (Sandbox Code Playgroud)

因为排序意味着xs[i1] <= xs[i2]和ys[j1] <= ys[j2].

因此,例如,如果索引对(0, 1)并且(1, 0)在堆上,并且第二个获胜,则(2, 0)需要添加到堆中但(1, 1)不是:仅从索引中,我们知道(0, 1)堆中剩余的产品是不大于产品(1, 1).只有当需要(0, 1)删除时才(1, 1)需要添加.

一般来说,每一对形式(i, 0)都有一个直接的前任(i-1, 0),而(0, j)单个(0, j-1),所有其他(i, j)都有两个直接的前辈: (i-1, j)和(i, j-1).除非所有的前辈都已从堆中取出,否则无需在堆上放置一对.

这导致了这个代码,它看似"更优雅",因为更加对称:

def upprod(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 3-tuples:
    #     (product, xs index, ys index)
    h = [(xs[0] * ys[0], 0, 0)]

    # interior points for which only one immediate predecessor has
    # been processed; there's no need to put them in the heap
    # until their second predecessor has been processed too
    pending = set()

    def add(xi, yi):
        if xi < lenxs and yi < lenys:
            if xi and yi: # if either is 0, only one predecessor
                p = xi, yi
                if p in pending:
                    pending.remove(p)
                else:
                    pending.add(p)
                    return
            heappush(h, (xs[xi] * ys[yi], xi, yi))

    while h:
        prod, xi, yi = heappop(h)
        yield prod
        # same x with next y; and same y with next x
        add(xi, yi + 1)
        add(xi + 1, yi)
    assert not pending

Run Code Online (Sandbox Code Playgroud)

与第一个代码相比,它在许多情况下使堆保持较小.但堆操作需要时间对数的堆条目,并且堆仍然可以增长到len(xs)条目,所以这不是一个胜利.它可能会丢失两个新函数调用的开销(虽然内联那些太难看了).

Answer 2

PM *_*ing 5

我的解决方案是创建一个生成器列表，乘积矩阵中的每一行一个生成器，然后用于heapq.merge对这些生成器的输出进行排序。在 32 位机器上，每个生成器的大小为 44 字节，因此整个生成器列表仅消耗适量的 RAM。

heapq.merge（当没有提供排序键函数时）通过为传递给它的每个可迭代对象创建一个三元组来工作。该元组包含可迭代对象的下一个值、可迭代对象的索引号以及对可迭代对象__next__方法的引用。它将这些元组放置到堆上以执行可迭代值的合并排序。您可以在其Python源代码中查看详细信息。

因此，我的方法并不像蒂姆·彼得斯的解决方案那么节俭，但也不算太破烂，恕我直言。;)

def sorted_prod_merge(xs, ys):
    ''' mergesort generators of the rows. '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    def gen(x):
        for y in ys:
            yield x * y
    yield from merge(*[gen(x) for x in xs])

Run Code Online (Sandbox Code Playgroud)

这里有一些代码显示了Tim Peters 的解决方案以及我的一些其他版本的解决方案timeit的运行时间。sorted_prod_merge我使用了 Tim 的变量名来保持代码的统一。有趣的是，蒂姆的第一个版本的速度大约是他更高级的解决方案的两倍。我的sorted_prod_row运行速度很快，但是内存消耗太严重了。

该代码使用配方timeit中给出的技术来耗尽迭代器：我们将其提供给零长度双端队列。该代码对每次运行的 3 个结果进行排序。这是因为最小结果是重要的，其他值仅指示测试运行时系统的差异。有关详细信息，请参阅文档中的注释。itertoolstime_testTimerTimer.repeat

from heapq import heappush, heappop, merge from random import seed, randrange from timeit import Timer from collections import deque seed(163) # Brute force method, as a generator def sorted_prod_brute(xs, ys): yield from sorted(x * y for x in xs for y in ys) # By Tim Peters def upprod1(xs, ys): # xs and ys must be sorted, and non-negative from heapq import heappush, heappop # make xs the shorter if len(ys) < len(xs): xs, ys = ys, xs if not xs: return lenxs = len(xs) lenys = len(ys) # the heap holds 4-tuples: # (product, xs index, ys index, xs[xs index]) h = [(xs[0] * ys[0], 0, 0, xs[0])] while h: prod, xi, yi, x = heappop(h) yield prod # same x with next y yi += 1 if yi < lenys: heappush(h, (x * ys[yi], xi, yi, x)) # if this is the first time we used x, start # the next x going if yi == 1: xi += 1 if xi < lenxs: x = xs[xi] heappush(h, (x * ys[0], xi, 0, x)) # By Tim Peters def upprod2(xs, ys): # xs and ys must be sorted, and non-negative from heapq import heappush, heappop # make xs the shorter if len(ys) < len(xs): xs, ys = ys, xs if not xs: return lenxs = len(xs) lenys = len(ys) # the heap holds 3-tuples: # (product, xs index, ys index) h = [(xs[0] * ys[0], 0, 0)] # interior points for which only one immediate predecessor has # been processed; there's no need to put them in the heap # until their second predecessor has been processed too pending = set() def add(xi, yi): if xi < lenxs and yi < lenys: doit = True if xi and yi: # if either is 0, only one predecessor p = xi, yi if p in pending: pending.remove(p) else: pending.add(p) doit = False if doit: heappush(h, (xs[xi] * ys[yi], xi, yi)) while h: prod, xi, yi = heappop(h) yield prod # same x with next y; and same y with next x add(xi, yi + 1) add(xi + 1, yi) assert not pending def sorted_prod_merge(xs, ys): ''' mergesort generators of the rows. ''' if len(ys) < len(xs): xs, ys = ys, xs def gen(x): for y in ys: yield x * y yield from merge(*[gen(x) for x in xs]) def sorted_prod_row(xs, ys): ''' Heapsort, row by row. Fast, but not space-efficient: the maximum heap size grows to almost len(ys) * len(xs) ''' if len(ys) < len(xs): xs, ys = ys, xs if not xs: return x, xs = xs[0], xs[1:] heap = [] #big = 0 for y in ys: lo = x * y while heap and heap[0] <= lo: yield heappop(heap) yield lo for u in xs: heappush(heap, u * y) #big = max(big, len(heap)) #print(big) while heap: yield heappop(heap) def sorted_prod_diag(xs, ys): ''' Heapsort, going along the diagonals 50% slower than sorted_prod_row, but more space-efficient: the maximum heap size grows to around 0.5 * len(ys) * len(xs) ''' if not (xs and ys): return lenxs, lenys = len(xs), len(ys) heap = [] #big = 0 for n in range(lenxs + lenys - 1): row = sorted(xs[n - i] * ys[i] for i in range(max(0, n + 1 - lenxs), min(lenys, n + 1))) lo = row[0] while heap and heap[0] <= lo: yield heappop(heap) yield lo for u in row[1:]: heappush(heap, u) #big = max(big, len(heap)) #print(big) #assert not heap def sorted_prod_block(xs, ys): ''' yield the top left corner, then merge sort the top row, the left column and the remaining block. So we end up with max(len(xs), len(ys)) recursively nested calls to merge(). It's ok for small lists, but too slow otherwise. ''' if not (xs and ys): return x, *xs = xs y, *ys = ys yield x * y row = (y * u for u in xs) col = (x * v for v in ys) yield from merge(row, col, sorted_prod_block(xs, ys)) def sorted_prod_blockI(xs, ys): ''' Similar to sorted_prod_block except we use indexing to avoid creating sliced copies of the lists ''' lenxs, lenys = len(xs), len(ys) def sorted_block(xi, yi): if xi == lenxs or yi == lenys: return x, y = xs[xi], ys[yi] yield x * y xi, yi = xi + 1, yi + 1 row = (xs[i] * y for i in range(xi, lenxs)) col = (ys[i] * x for i in range(yi, lenys)) yield from merge(row, col, sorted_block(xi, yi)) yield from sorted_block(0, 0) functions = ( sorted_prod_brute, upprod1, upprod2, sorted_prod_merge, #sorted_prod_row, sorted_prod_diag, #sorted_prod_block, #sorted_prod_blockI, ) UB = 1000 def verify(numtests, maxlen=10): print('Verifying. maxlen =', maxlen) for k in range(numtests): lenxs = randrange(maxlen + 1) lenys = randrange(maxlen + 1) print(k, ':', lenxs, '*', lenys, '=', lenxs * lenys) xs = sorted(randrange(UB) for i in range(lenxs)) ys = sorted(randrange(UB) for i in range(lenys)) good = list(sorted_prod_brute(xs, ys)) for func in functions[1:]: result = list(func(xs, ys)) if result != good: print(func.__name__, 'failed!') print() def time_test(loops=20): timings = [] for func in functions: # Consume the generator output by feeding it to a zero-length deque t = Timer(lambda: deque(func(xs, ys), maxlen=0)) result = sorted(t.repeat(3, loops)) timings.append((result, func.__name__)) timings.sort() for result, name in timings: print('{:18} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result)) print() verify(10, 10) verify(20, 100) print('\nTimings') loops = 8192 minlen = 5 for k in range(6): lenxs = randrange(minlen, 2 * minlen) lenys = randrange(minlen, 2 * minlen) print(k, ':', loops, 'loops.', lenxs, '*', lenys, '=', lenxs * lenys) xs = sorted(randrange(UB) for i in range(lenxs)) ys = sorted(randrange(UB) for i in range(lenys)) time_test(loops) minlen *= 2 loops //= 4
Run Code Online (Sandbox Code Playgroud)
这是我古老的 2GHz 32 位单核机器上的输出，在旧的 Linux Debian 衍生发行版上运行 Python 3.6.0。YMMV。

def sorted_prod_merge(xs, ys): ''' mergesort generators of the rows. ''' if len(ys) < len(xs): xs, ys = ys, xs def gen(x): for y in ys: yield x * y yield from merge(*[gen(x) for x in xs])
Run Code Online (Sandbox Code Playgroud)

Answer 3

ab1*_*123 -2

似乎没有任何其他方法可以在不创建列表的情况下对这些输出进行排序，因为输出无法在不存储的情况下进行排序。以下是您可以如何做到这一点。

myList = []

for i in range(len(l1)):
    for j in range(len(l2)):
        output = l1[i] * l2[j]
        myList.append(output)
myList.sort()
print(myList)

Run Code Online (Sandbox Code Playgroud)

希望有帮助。

编辑往往不愿意更改问题或答案中的代码，因为担心这可能会改变帖子的意图。原作者自己修复它通常会更好:-) (2认同)

归档时间：	7 年，5 月前
查看次数：	207 次
最近记录：	7 年，5 月前