从具有重叠的池中挑选无序组合

qwr*_*qwr 25 python combinations combinatorics

我有值池,我想通过从某些池中挑选来生成所有可能的无序组合.

例如,我想从池0,池0和池1中选择:

>>> pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
>>> part = (0, 0, 1)
>>> list(product(*(pools[i] for i in part)))
[(1, 1, 2), (1, 1, 3), (1, 1, 4), (1, 2, 2), (1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 3), (1, 3, 4), (2, 1, 2), (2, 1, 3), (2, 1, 4), (2, 2, 2), (2, 2, 3), (2, 2, 4), (2, 3, 2), (2, 3, 3), (2, 3, 4), (3, 1, 2), (3, 1, 3), (3, 1, 4), (3, 2, 2), (3, 2, 3), (3, 2, 4), (3, 3, 2), (3, 3, 3), (3, 3, 4)]
Run Code Online (Sandbox Code Playgroud)

这将通过从池0,池0和池1中选择来生成所有可能的组合.

但是顺序对我来说无关紧要,因此很多组合实际上都是重复的.例如,由于我用笛卡儿积,既(1, 2, 4)(2, 1, 4)被生成.

我想出了一个简单的方法来缓解这个问题.对于从单个池中挑选的成员,我选择不使用订购combinations_with_replacement.我想要从每个池中抽取多少次.代码如下所示:

cnt = Counter()
for ind in part: cnt[ind] += 1
blocks = [combinations_with_replacement(pools[i], cnt[i]) for i in cnt]
return [list(chain(*combo)) for combo in product(*blocks)]
Run Code Online (Sandbox Code Playgroud)

如果我碰巧多次从同一个池中进行选择,这会减少排序重复.但是,所有池都有很多重叠,并且combinations_with_replacement在多个池上使用合并会产生一些无效组合.有没有更有效的方法来生成无序组合?

编辑:有关输入的额外信息:部件和池的数量很小(~5和~20),为简单起见,每个元素都是整数.我已经解决了实际问题所以这只是为了学术兴趣.假设每个池中有数千个整数,但有些池很小,只有几十个.所以某种联合或交叉似乎是要走的路.

Tim*_*ers 11

这不是一个"答案",而是一个更难思考的刺激;-)为了具体,我将OP的代码包装,稍微被驱逐,在一个也消除了重复的函数:

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import chain, product
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    seen = set()
    cnt = Counter(ixs) # map index to count
    blocks = [cwr(pools[i], count) for i, count in cnt.items()]
    for t in product(*blocks):
        t = tuple(sorted(chain(*t)))
        if t not in seen:
            seen.add(t)
            yield t
Run Code Online (Sandbox Code Playgroud)

我不担心在这里排序 - 它的内存效率高,而且对于小元组来说,可能比创建Counter对象所涉及的所有开销更快.

但无论如何,这里的重点是强调OP通过重新设计问题来获得的实际价值combinations_with_replacement(cwr).考虑这些输入:

N = 64
pools = [[0, 1]]
ixs = [0] * N
Run Code Online (Sandbox Code Playgroud)

只有65个独特的结果,该函数即时生成它们,根本没有内部重复.另一方面,基本相同

pools = [[0, 1]] * N
ixs = range(N)
Run Code Online (Sandbox Code Playgroud)

也有相同的65个独特的结果,但基本上是永远运行(例如,到目前为止给出的其他答案),通过2**64种可能性. cwr在这里没有帮助,因为每个池索引只出现一次.

因此,对于任何解决方案来说,有一个天文空间可以改进,"只是"从完整的笛卡尔产品中剔除重复数据,其中一些可以通过做OP已经做过的事情来赢得.

在我看来,最有希望的方法是编写一个自定义生成器(不是一个主要依赖于itertools函数的生成器),它以字典顺序生成所有可能性(因此,通过构造,不会创建重复项开始).但这需要首先对输入池进行一些"全局"分析,而我开始使用的代码很快就会变得比我现在可以花时间更加复杂.

一个基于@ user2357112的答案

cwr与@ user2357112的增量重复数据删除相结合,提供了一种简单的算法,可以在我拥有的所有测试用例中快速运行.例如,对于上述示例中的两种拼写而言,它基本上是即时的[0, 1] ** 64,并且在@Joseph Wood的答案结束时运行示例的速度大致与他说他的C++代码运行时一样快(在Python 3.7.0下我的盒子上0.35秒,并且,是的,找到162295结果):

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    result = {()}
    for i, count in Counter(ixs).items():
        result = {tuple(sorted(old + new))
                  for new in cwr(pools[i], count)
                  for old in result}
    return result
Run Code Online (Sandbox Code Playgroud)

为了让其他Pythonistas更容易尝试上一个例子,这里的输入是可执行的Python:

pools = [[1, 10, 14, 6],
         [7, 2, 4, 8, 3, 11, 12],
         [11, 3, 13, 4, 15, 8, 6, 5],
         [10, 1, 3, 2, 9, 5,  7],
         [1, 5, 10, 3, 8, 14],
         [15, 3, 7, 10, 4, 5, 8, 6],
         [14, 9, 11, 15],
         [7, 6, 13, 14, 10, 11, 9, 4]]
ixs = range(len(pools))
Run Code Online (Sandbox Code Playgroud)

然而,OP后来补充说,他们通常有大约20个池,每个池有数千个元素.对于构建完整笛卡尔积的任何方法,1000**20 = 1e60都是实际可行的,无论它如何巧妙地消除重复.尽管如此,它仍然很清楚它们有多少重复,但是,这种"增量去重复"是否足够实用,这也很清楚.

理想情况下,我们有一个生成器,按字典顺序一次产生一个结果.

懒惰的词典一次一代

在增量重复数据删除的基础上,假设我们有一个严格增加(词典)的排序元组序列,将相同的元组附加T到每个元组,并再次对它们进行排序.然后,导出的序列仍处于严格递增的顺序.例如,在左栏中,我们有10个唯一的对range(4),在右列中,在我们追加(并再次排序)2之后会发生什么:

00 002
01 012
02 022
03 023
11 112
12 122
13 123
22 222
23 223
33 233
Run Code Online (Sandbox Code Playgroud)

它们按排序顺序开始,派生的三元组也按排序顺序排列.我将跳过简单的证明(草图:if t1t2是否相邻元组,t1 < t2并且让它i成为最小的索引t1[i] != t2[i].然后t1[i] < t2[i](什么是"lexicographic"意味着).然后如果你扔进x两个元组,按案例继续:是x <= t1[i]吗?在?t1[i]t2[i]?之间x >= t2[i]?在每种情况下都很容易看出第一个派生元组严格地小于第二个派生元组.)

因此,假设我们有result来自某些池的所有唯一排序元组的排序序列,当我们将新池的元素添加P到元组中时会发生什么?好吧,如上所述,

[tuple(sorted(old + (P[0],))) for old in result]
Run Code Online (Sandbox Code Playgroud)

也是排序,等等

[tuple(sorted(old + (P[i],))) for old in result]
Run Code Online (Sandbox Code Playgroud)

所有irange(len(P)).这些保证已经排序的序列可以通过合并heapq.merge(),并且另一个生成器(killdups()下面)在合并结果上运行以在运行中清除重复项.例如,没有必要保留一组到目前为止看到的所有元组.因为合并的输出是非递减的,所以仅检查下一个结果是否与最后的结果输出相同就足够了.

让这个懒惰地工作很精致.整个结果 - 远程序列必须由添加的新池的每个元素访问,但我们不希望在一个gulp中实现整个事物.而是itertools.tee()允许下一个池的每个元素按照自己的步调遍历结果到目前为止的序列,并在所有新池元素完成后自动为每个结果项释放内存.

需要函数build1()(或一些类似的工作)来确保在正确的时间访问正确的值.例如,如果身体build1()内联在那里它被称为粘贴,代码将失败壮观(身体将访问必将最终值rcopynew什么,而不是他们在创建生成器表达式的时候必然).

In all, of course this is somewhat slower, due to layers of delayed generator calls and heap merges. In return, it returns the results in lexicographic order, can start delivering results very quickly, and has lower peak memory burden if for no other reason than that the final result sequence isn't materialized at all (little is done until the caller iterates over the returned generator).

Tech note: don't fear sorted() here. The appending is done via old + new for a reason: old is already sorted, and new is typically a 1-tuple. Python's sort is linear-time in this case, not O(N log N).

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import tee
    from collections import Counter
    from heapq import merge

    def killdups(xs):
        last = None
        for x in xs:
            if x != last:
                yield x
                last = x

    def build1(rcopy, new):
        return (tuple(sorted(old + new)) for old in rcopy)

    assert all(0 <= i < len(pools) for i in ixs)
    result = [()]
    for i, count in Counter(ixs).items():
        poolelts = list(cwr(pools[i], count))
        xs = [build1(rcopy, new)
              for rcopy, new in zip(tee(result, len(poolelts)),
                                    poolelts)]
        result = killdups(merge(*xs))
    return result
Run Code Online (Sandbox Code Playgroud)

2 inputs

Turns out that for the 2-input case there's an easy approach derived from set algebra. If x and y are the same, cwr(x, 2) is the answer. If x and y are disjoint, product(x, y). Else the intersection c of x and y is non-empty, and the answer is the catenation of 4 cross-products obtained from the 3 pairwise-disjoint sets c, x-c, and y-c: cwr(c, 2), product(x-c, c), product(y-c, c), and product(x-c, y-c). Proof is straightforward but tedious so I'll skip it. For example, there are no duplicates between cwr(c, 2) and product(x-c, c) because every tuple in the latter contains an element from x-c, but every tuple in the former contains elements only from c, and x-c and c are disjoint by construction. There are no duplicates within product(x-c, y-c) because the two inputs are disjoint (if they contained an element in common, that would have been in the intersection of x and y, contradicting that x-c has no element in the intersection). Etc.

Alas, I haven't found a way to generalize this beyond 2 inputs, which surprised me. It can be used on its own, or as a building block in other approaches. For example, if there are many inputs, they can be searched for pairs with large intersections, and this 2-input scheme used to do those parts of the overall products directly.

Even at just 3 inputs, it's not clear to me how to get the right result for

[1, 2], [2, 3], [1, 3]
Run Code Online (Sandbox Code Playgroud)

The full Cartesian product has 2**3 = 8 elements, only one of which repeats: (1, 2, 3) appears twice (as (1, 2, 3) and again as (2, 3, 1)). Each pair of inputs has a 1-element intersection, but the intersection of all 3 is empty.

Here's an implementation:

def pair(x, y):
    from itertools import product, chain
    from itertools import combinations_with_replacement
    x = set(x)
    y = set(y)
    c = x & y
    chunks = []
    if c:
        x -= c
        y -= c
        chunks.append(combinations_with_replacement(c, 2))
        if x:
            chunks.append(product(x, c))
        if y:
            chunks.append(product(y, c))
    if x and y:
        chunks.append(product(x, y))
    return chain.from_iterable(chunks)
Run Code Online (Sandbox Code Playgroud)

A Proof-of-Concept Based on Maximal Matching

This blends ideas from @Leon's sketch and an approach @JosephWoods sketched in comments. It's not polished, and can obviously be sped up, but it's reasonably quick on all the cases I tried. Because it's rather complex, it's probably more useful to post it in an already-hard-enough-to-follow un-optimized form anyway!

This doesn't make any attempt to determine the set of "free" pools (as in @Leon's sketch). Primarily because I didn't have code sitting around for that, and partly because it wasn't immediately clear how to accomplish that efficiently. I did have code sitting around to find a match in a bipartite graph, which required only a few changes to use in this context.

So this tries plausible result prefixes in lexicographic order, as in @JosephWood's sketch, and for each sees whether it's actually possible to construct via checking whether a bipartite-graph match exists.

So while the details of @Leon's sketch are largely unimplemented here, the visible behaviors are much the same: it produces results in lexicographic order, it doesn't need to check for duplicates, it's a lazy generator, peak memory use is proportional to the sum of the lengths of the pools, it can obviously be parallelized in many ways (set different processes to work on different regions of the result space), and the key to making it majorly faster lies in reducing the massive amounts of redundant work done by the graph-matching function (a great deal of what it does on each call merely reproduces what it did on the previous call).

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import chain, repeat, islice

    elt2pools = defaultdict(set)
    npools = 0
    for i, count in Counter(ixs).items():
        set_indices = set(range(npools, npools + count))
        for elt in pools[i]:
            elt2pools[elt] |= set_indices
        npools += count
    elt2count = {elt : len(ps) for elt, ps in elt2pools.items()}

    cands = sorted(elt2pools.keys())
    ncands = len(cands)

    result = [None] * npools

    # Is it possible to match result[:n] + [elt]*count?
    # We already know it's possible to match result[:n], but
    # this code doesn't exploit that.
    def match(n, elt, count):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y in y2x:
                        if extend(y2x[y], seen):
                            y2x[y] = x
                            return True
                    else:
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in chain(islice(result, n), repeat(elt, count)):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)
        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return False
        return True

    def inner(i, j): # fill result[j:] with elts from cands[i:]
        if j >= npools:
            yield tuple(result)
            return
        for i in range(i, ncands):
            elt = cands[i]
            # Find the most times `elt` can be added.
            count = min(elt2count[elt], npools - j)
            while count:
                if match(j, elt, count):
                    break
                count -= 1
            # Since it can be added `count` times, it can also
            # be added any number of times less than `count`.
            for k in range(count):
                result[j + k] = elt
            while count:
                yield from inner(i + 1, j + count)
                count -= 1

    return inner(0, 0)
Run Code Online (Sandbox Code Playgroud)

EDIT: note that there's a potential trap here, illustrated by the pair of pools range(10_000) and range(100_000). After producing (9999, 99999), the first position increments to 10000, and then it continues for a very long time deducing that there's no match for any of the possibilities in 10001 .. 99999 in the second position; and then for 10001 in the first position no match for any of the possibilities in 10002 .. 99999 in the second position; and so on. @Leon's scheme instead would have noted that range(10_000) was the only free pool remaining having picked 10000 in the first position, and noted at once then that range(10_000) doesn't contain any values greater than 10000. It would apparently need to do that again for 10001, 10002, ..., 99999 in the first position. That's a linear-time rather than quadratic-time waste of cycles, but a waste all the same. Moral of the story: don't trust anything until you have actual code to try ;-)

And One Based on @Leon's Scheme

Following is a more-than-less faithful implementation of @Leon's ideas. I like the code better than my "proof of concept" (POC) code just above, but was surprised to find that the new code runs significantly slower (a factor of 3 to 4 times slower on a variety of cases akin to @JospephWood's randomized example) relative to a comparably "optimized" variant of the POC code.

The primary reason appears to be more calls to the matching function. The POC code called that once per "plausible" prefix. The new code doesn't generate any impossible prefixes, but for each prefix it generates may need to make multiple match() calls to determine the possibly smaller set of free pools remaining. Perhaps there's a cleverer way to do that.

Note that I added one twist: if a free pool's elements are all smaller than the last element of the prefix so far, it remains "a free pool" with respect to the prefix, but it's useless because none of its elements can appear in the candidates. This doesn't matter to the outcome, but it means the pool remains in the set of free pools for all remaining recursive calls, which in turn means they can waste time determining that it's still a "free pool". So when a free pool can no longer be used for anything, this version removes it from the set of free pools. This gave a significant speedup.

Note: there are many ways to try matching, some of which have better theoretical O() worst-case behavior. In my experience, simple depth-first (as here) search runs faster in real life on typical cases. But it depends very much on characteristics of what "typical" graphs look like in the application at hand. I haven't tried other ways here.

Bottom lines, ignoring the "2 inputs" special-case code:

  • Nothing here beats incremental de-duplication for speed, if you have the RAM. But nothing is worse than that for peak memory burden.

  • Nothing beats the matching-based approaches for frugal memory burden. They're in an entirely different universe on that measure. They're also the slowest, although at least in the same universe ;-)

The code:

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(list)
    allpools = []
    npools = 0
    for i, count in Counter(ixs).items():
        indices = list(range(npools, npools + count))
        plist = sorted(pools[i])
        for elt in plist:
            elt2pools[elt].extend(indices)
        for i in range(count):
            allpools.append(plist)
        npools += count
    pools = allpools
    assert npools == len(pools)

    result = [None] * npools

    # Is it possible to match result[:n] not using pool
    # bady?  If not, return None.  Else return a matching,
    # a dict whose keys are pool indices and whose values
    # are a permutation of result[:n].
    def match(n, bady):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x and y != bady:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        for x in freexs:
            if not extend(x, {bady}):
                return None
        return y2x

    def inner(j, freepools): # fill result[j:]
        from bisect import bisect_left
        if j >= npools:
            yield tuple(result)
            return
        if j:
            new_freepools = set()
            allcands = set()
            exhausted = set()  # free pools with elts too small
            atleast = result[j-1]
            for pi in freepools:
                if pi not in new_freepools:
                    m = match(j, pi)
                    if not m:  # match must use pi
                        continue
                    # Since `m` is a match to result[:j],
                    # any pool in freepools it does _not_
                    # use must still be free.
                    new_freepools |= freepools - m.keys()
                    assert pi in new_freepools
                # pi is free with respect to result[:j].
                pool = pools[pi]
                if pool[-1] < atleast:
                    exhausted.add(pi)
                else:
                    i = bisect_left(pool, atleast)
                    allcands.update(pool[i:])
            if exhausted:
                freepools -= exhausted
                new_freepools -= exhausted
        else: # j == 0
            new_freepools = freepools
            allcands = elt2pools.keys()

        for result[j] in sorted(allcands):
            yield from inner(j + 1, new_freepools)

    return inner(0, set(range(npools)))
Run Code Online (Sandbox Code Playgroud)

Note: this has its own classes of "bad cases". For example, passing 128 copies of [0, 1] consumes about 2 minutes(!) of time on my box to find the 129 results. The POC code takes under a second, while some of the non-matching approaches appear instantaneous.

I won't go into detail about why. Suffice it to say that because all the pools are the same, they all remain "free pools" no matter how deep the recursion goes. match() never has a hard time, always finding a complete match for the prefix in its first (greedy) pass, but even that takes time proportional to the square of the current prefix length (== the current recursion depth).

Pragmatic hybrid

One more here. As noted before, the matching-based approaches suffer some from the expense of graph matching as a fundamental operation repeated so often, and have some unfortunate bad cases pretty easy to stumble into.

Highly similar pools cause the set of free pools to shrink slowly (or not at all). But in that case the pools are so similar that it rarely matters which pool an element is taken from. So the approach below doesn't try to keep exact track of free pools, picks arbitrary pools so long as such are obviously available, and resorts to graph-matching only when it gets stuck. That seems to work well. As an extreme example, the 129 results from 128 [0, 1] pools are delivered in less than a tenth of second instead of in two minutes. It turns out it never needs to do graph-matching in that case.

Another problem with the POC code (and less so for the other match-based approach) was the possibility of spinning wheels for a long time after the last result was delivered. A pragmatic hack solves that one completely ;-) The last tuple of the sequence is easily computed in advance, and code raises an internal exception to end everything immediately after the last tuple is delivered.

That's it for me! A generalization of the "two inputs" case would remain very interesting to me, but all the itches I got from the other approaches have been scratched now.

def combogen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(set)
    npools = 0
    cands = []
    MAXTUPLE = []
    for i, count in Counter(ixs).items():
        indices = set(range(npools, npools + count))
        huge = None
        for elt in pools[i]:
            elt2pools[elt] |= indices
            for i in range(count):
                cands.append(elt)
            if huge is None or elt > huge:
                huge = elt
        MAXTUPLE.extend([huge] * count)
        npools += count
    MAXTUPLE = tuple(sorted(MAXTUPLE))
    cands.sort()
    ncands = len(cands)
    ALLPOOLS = set(range(npools))
    availpools = ALLPOOLS.copy()
    result = [None] * npools

    class Finished(Exception):
        pass

    # Is it possible to match result[:n]? If not, return None.  Else
    # return a matching, a dict whose keys are pool indices and
    # whose values are a permutation of result[:n].
    def match(n):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return None
        return y2x

    def inner(i, j):  # fill result[j:] with cands[i:]
        nonlocal availpools
        if j >= npools:
            r = tuple(result)
            yield r
            if r == MAXTUPLE:
                raise Finished
            return
        restore_availpools = None
        last = None
        jp1 = j + 1
        for i in range(i, ncands):
            elt = cands[i]
            if elt == last:
                continue
            last = result[j] = elt
            pools = elt2pools[elt] & availpools
            if pools:
                pool = pools.pop() # pick one - arbitrary
                availpools.remove(pool)
            else:
                # Find _a_ matching, and if that's possible fiddle
                # availpools to pretend that's the one we used all
                # along.
                m = match(jp1)
                if not m: # the prefix can't be extended with elt
                    continue
                if restore_availpools is None:
                    restore_availpools = availpools.copy()
                availpools = ALLPOOLS - m.keys()
                # Find a pool from which elt was taken.
                for pool, v in m.items():
                    if v == elt:
                        break
                else:
                    assert False
            yield from inner(i+1, jp1)
            availpools.add(pool)

        if restore_availpools is not None:
            availpools = restore_availpools

    try:
        yield from inner(0, 0)
    except Finished:
        pass
Run Code Online (Sandbox Code Playgroud)

  • @JosephWood,我很感激欣赏;-)这对StackOverflow来说是一个讽刺的问题,我努力工作最难的问题是获得最少的赞成;-)我曾尝试过你勾画的方案,但从未让它完全发挥作用.对`[2,3]`和`[2,4]`说明了一个关键问题:要得到`(2,3)`需要从第二对中挑选2,但得到'(2,4)`需要从第一对中挑选2.所以任何严格"从左到右"遍历各种可能性的方案都行不通.我还没有找到一种始终有效的方法.我一直在等你 :-) (2认同)

Jos*_*ood 8

这是一个难题.我认为在一般情况下你最好的选择是实现一个hash table键是a的位置multiset,值是你的实际组合.这类似于@ErikWolf所提到的,但是这种方法首先避免产生重复,因此不需要过滤.当我们遇到时它也会返回正确的结果multisets.

有一个更快的解决方案,我现在正在戏弄,但以后保存.忍受我.

正如评论中所提到的,一种似乎可行的方法是组合所有池,并简单地生成此组合池的组合,选择池的数量.您需要一个能够生成多个集合组合的工具,其中有一个我知道的可用的组合python.它在sympy图书馆里from sympy.utilities.iterables import multiset_combinations.这个问题是我们仍然会产生重复的值,更糟糕的是,我们产生的结果是类似的setproduct组合不可能获得的.例如,如果我们要执行类似排序的操作并组合OP中的所有池并应用以下内容:

list(multiset_permutations([1,2,2,3,3,4,4,5]))
Run Code Online (Sandbox Code Playgroud)

一些结果将是[1 2 2],[4 4 5]并且都是不可能获得的[[1, 2, 3], [2, 3, 4], [3, 4, 5]].

除特殊情况外,我不知道如何避免检查每种可能的产品.我希望我错了.

算法概述
主要思想是将矢量乘积的组合映射到唯一组合,而不必过滤掉重复项.OP(即(1, 2, 3)(1, 3, 2))给出的示例应该只映射到一个值(其中一个值,因为顺序无关紧要).我们注意到两个向量是相同的集合.现在,我们也有以下情况:

vec1 = (1, 2, 1)
vec2 = (2, 1, 1)
vec3 = (2, 2, 1)
Run Code Online (Sandbox Code Playgroud)

我们需要vec1vec2映射到相同的值,而vec3需要映射到它自己的值.这是集合的问题,因为所有这些都是等价的集合(使用集合,元素因此是唯一的{a, b, b}并且{a, b}是等价的).

这就是多集合发挥作用的地方.随着多集,(2, 2, 1)并且(1, 2, 1)是不同的,但(1, 2, 1)(2, 1, 1)是相同的.这很好.我们现在有一种生成唯一键的方法.

因为我不是python程序员,所以我将继续C++.

如果我们尝试完全按照原样执行所有操作,我们将遇到一些问题.据我所知,你不能std::multiset<int>作为一个关键部分std::unordered_map.但是,我们可以定期std::map.它不像下面的哈希表那样高效(它实际上是一个红黑树),但它仍然提供了不错的性能.这里是:

void cartestionCombos(std::vector<std::vector<int> > v, bool verbose) {

    std::map<std::multiset<int>, std::vector<int> > cartCombs;

    unsigned long int len = v.size();
    unsigned long int myProd = 1;
    std::vector<unsigned long int> s(len);

    for (std::size_t j = 0; j < len; ++j) {
        myProd *= v[j].size();
        s[j] = v[j].size() - 1;
    }

    unsigned long int loopLim = myProd - 1;
    std::vector<std::vector<int> > res(myProd, std::vector<int>());
    std::vector<unsigned long int> myCounter(len, 0);
    std::vector<int> value(len, 0);
    std::multiset<int> key;

    for (std::size_t j = 0; j < loopLim; ++j) {
        key.clear();

        for (std::size_t k = 0; k < len; ++k) {
            value[k] = v[k][myCounter[k]];
            key.insert(value[k]);
        }

        cartCombs.insert({key, value});

        int test = 0;
        while (myCounter[test] == s[test]) {
            myCounter[test] = 0;
            ++test;
        }

        ++myCounter[test];
    }

    key.clear();
    // Get last possible combination
    for (std::size_t k = 0; k < len; ++k) {
        value[k] = v[k][myCounter[k]];
        key.insert(value[k]);
    }

    cartCombs.insert({key, value});

    if (verbose) {
        int count = 1;

        for (std::pair<std::multiset<int>, std::vector<int> > element : cartCombs) {
            std::string tempStr;

            for (std::size_t k = 0; k < len; ++k)
                tempStr += std::to_string(element.second[k]) + ' ';

            std::cout << count << " : " << tempStr << std::endl;
            ++count;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

对于长度为4到8的8个向量的测试用例,填充了1到15的随机整数,上述算法在我的计算机上运行大约5秒钟.考虑到我们的产品总计近250万,但我们可以做得更好,这还不错.但是怎么样?

使用std::unordered_map在恒定时间内构建的密钥可以获得最佳性能.我们上面的关键是以对数时间(多集,地图和哈希映射复杂度)构建的.所以问题是,我们如何克服这些障碍?

最棒的表演

我们知道我们必须放弃std::multiset.我们需要某种具有可交换类型属性的对象,同时也提供独特的结果.

输入算术基本定理

它指出每个数字可以通过素数的乘积唯一地表示(直到因子的顺序).这有时称为素数分解.

所以现在,我们可以像以前那样简单地进行,但不是构造多重集,而是将每个索引映射到素数并乘以结果.这将为我们的钥匙提供恒定的时间构造.下面是一个示例,显示了我们上面创建的示例中此技术的强大功能(P下面的NB 是素数列表... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to            product
vec1 = (1, 2, 1)    -->>    P[1], P[2], P[1]   --->>   3, 5, 3    -->>    45
vec2 = (2, 1, 1)    -->>    P[2], P[1], P[1]   --->>   5, 3, 3    -->>    45
vec3 = (2, 2, 1)    -->>    P[2], P[2], P[1]   --->>   5, 5, 3    -->>    75
Run Code Online (Sandbox Code Playgroud)

这太棒了!!vec1vec2映射到相同的数字,而vec3像我们希望的那样映射到不同的值.

void cartestionCombosPrimes(std::vector<std::vector<int> > v, 
                        std::vector<int> primes,
                        bool verbose) {

    std::unordered_map<int64_t, std::vector<int> > cartCombs;

    unsigned long int len = v.size();
    unsigned long int myProd = 1;
    std::vector<unsigned long int> s(len);

    for (std::size_t j = 0; j < len; ++j) {
        myProd *= v[j].size();
        s[j] = v[j].size() - 1;
    }

    unsigned long int loopLim = myProd - 1;
    std::vector<std::vector<int> > res(myProd, std::vector<int>());
    std::vector<unsigned long int> myCounter(len, 0);
    std::vector<int> value(len, 0);
    int64_t key;

    for (std::size_t j = 0; j < loopLim; ++j) {
        key = 1;

        for (std::size_t k = 0; k < len; ++k) {
            value[k] = v[k][myCounter[k]];
            key *= primes[value[k]];
        }

        cartCombs.insert({key, value});

        int test = 0;
        while (myCounter[test] == s[test]) {
            myCounter[test] = 0;
            ++test;
        }

        ++myCounter[test];
    }

    key = 1;
    // Get last possible combination
    for (std::size_t k = 0; k < len; ++k) {
        value[k] = v[k][myCounter[k]];
        key *= primes[value[k]];
    }

    cartCombs.insert({key, value});
    std::cout << cartCombs.size() << std::endl;

    if (verbose) {
        int count = 1;

        for (std::pair<int, std::vector<int> > element : cartCombs) {
            std::string tempStr;

            for (std::size_t k = 0; k < len; ++k)
                tempStr += std::to_string(element.second[k]) + ' ';

            std::cout << count << " : " << tempStr << std::endl;
            ++count;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

在上面的相同示例中,将生成近250万个产品,上述算法在不到0.3秒的时间内返回相同的结果.

后一种方法有几点需要注意.我们必须让我们的素数产生先验,如果我们的笛卡尔积中有许多向量,那么关键可能会超出范围int64_t.第一个问题不应该难以克服,因为有许多资源(库,查找表等)可用于生成素数.我不太确定,但我已经读过后一个问题应该不是问题,python因为整数具有任意精度(Python整数范围).

我们还必须处理这样一个事实,即我们的源向量可能不是具有小值的漂亮整数向量.这可以通过在继续之前对所有向量中的所有元素进行排序来解决.例如,给出以下向量:

vec1 = (12345.65, 5, 5432.11111)
vec2 = (2222.22, 0.000005, 5)
vec3 = (5, 0.5, 0.8)
Run Code Online (Sandbox Code Playgroud)

排名他们,我们会得到:

rank1 = (6, 3, 5)
rank2 = (4, 0, 3)
rank3 = (3, 1, 2)
Run Code Online (Sandbox Code Playgroud)

现在,可以使用这些代替实际值来创建密钥.代码中唯一可以改变的部分是构建密钥的for循环(当然还有rank需要创建的对象):

for (std::size_t k = 0; k < len; ++k) {
    value[k] = v[k][myCounter[k]];
    key *= primes[rank[k][myCounter[k]]];
}
Run Code Online (Sandbox Code Playgroud)

编辑:
正如一些评论者指出的那样,上述方法掩盖了必须生成所有产品的事实.我应该第一次说.就个人而言,鉴于许多不同的演示文稿,我看不出它是如何避免的.

此外,如果有人好奇,这是我上面使用的测试用例:

[1 10 14  6],
[7  2  4  8  3 11 12],
[11  3 13  4 15  8  6  5],
[10  1  3  2  9  5  7],
[1  5 10  3  8 14],
[15  3  7 10  4  5  8  6],
[14  9 11 15],
[7  6 13 14 10 11  9  4]
Run Code Online (Sandbox Code Playgroud)

它应该返回162295唯一的组合.


use*_*ica 7

保存一些工作的一种方法可能是生成前k个选定池的重复数据删除组合,然后将这些组合扩展到第一个k + 1的重复数据删除组合.这使您可以避免单独生成和拒绝所有长度为20的组合,2, 1而不是1, 2前两个池:

def combinations_from_pools(pools):
    # 1-element set whose one element is an empty tuple.
    # With no built-in hashable multiset type, sorted tuples are probably the most efficient
    # multiset representation.
    combos = {()}
    for pool in pools:
        combos = {tuple(sorted(combo + (elem,))) for combo in combos for elem in pool}
    return combos
Run Code Online (Sandbox Code Playgroud)

但是,根据您所讨论的输入大小,无论您如何有效地生成组合,您都将无法处理所有这些组合.即使有20个相同的1000个元素池,也会有496432432489450355564471512635900731810050组合(1019选择20,星形和条形公式),或约5e41.如果你征服了地球并将全人类所有计算设备的所有处理能力都投入到了这项任务中,你仍然无法在这方面做出努力.您需要找到更好的方法来解决您的基础任务.


Leo*_*eon 5

到目前为止已经发布的答案(包括蒂姆·彼得斯一次一代懒惰词典)具有与输出大小成比例的最坏情况空间复杂度.我将概述一种方法,该方法将建设性地生成所有唯一的无序组合,而无需对内部生成的中间数据进行重复数据删除.我的算法以按字典顺序排序的顺序生成组合.与更简单的算法相比,它具有计算开销.但是它可以并行化(这样可以同时产生最终输出的不同范围).

这个想法如下.

所以我们有N池{P 1,...,P N }我们必须从中得出组合.我们可以很容易地识别出最小的组合(相对于上面提到的词典排序).设(x 1,x 2 ...,x N-1,x N)(其中x 1 <= x 2 <= ... <= x N-1 <= x N,每个x j是只是来自其中一个池{P i } 的最小元素.该最小组合之后将是零个或多个组合,其中前缀x 1,x 2 ...,x N-1相同并且最后位置在增加的值序列上运行.我们如何识别该序列?

我们来介绍以下定义:

给定组合前缀C =(x 1,x 2 ...,x K-1,x K)(其中K <N),如果后者(前缀)可以,则池P i 相对于C称为空闲从其他游泳池中抽出.

识别给定前缀的空闲池很容易减少到在二分图中找到最大匹配的问题.具有挑战性的部分是有效地进行(利用我们案例的细节).但是我会保存它以供稍后使用(这是正在进行的工作,一天之内就可以实现为Python程序).

因此,对于我们第一个组合的前缀(x 1,x 2 ...,x N-1),我们可以识别所有空闲池{FP i }.其中任何一个都可以用来为最后一个位置挑选一个元素.因此,感兴趣的序列是来自{FP 1 U FP 2 U ...} 的有序元素集,其大于或等于x N-1.

当最后一个位置用完时,我们必须增加最后一个位置,然后我们将重复找到最后一个位置的可能值的过程.不可思议的是,枚举最后一个(以及任何其他)位置的值的过程是相同的 - 唯一的区别是组合前缀的长度,基于必须识别的空闲池.

因此,以下递归算法应该做的工作:

  1. 以空组合前缀C开头.此时所有池都是免费的.
  2. 如果C的长度等于N,则输出C并返回.
  3. 将空闲池合并为一个排序列表S并从中删除所有小于C的最后一个元素的元素.
  4. 对于来自S do的每个值x
    • 新的组合前缀是C'=(C,x)
    • 当前组合前缀增加了一个,一些免费游泳池停止免费.识别它们并使用更新的空闲池列表和组合前缀C'递归到步骤1.

  • 好.我想我现在抓住这个 - 很酷!似乎很清楚它可以工作,并且可能只有最小的内存负担.我不清楚它的速度有多快.仅仅考虑我,我发现在将C相对于C的"自由"概念替换为"固定"时更容易掌握,其中如果P是每个最大匹配的一部分,则池P固定为C.也就是说,P _must_用于构造C(例如,P是包含C的一些元素的唯一池).固定池组只是一组免费池的补充. (2认同)
  • @JosephWood,添加并行性将是微不足道的.例如,如果`U`是所有池的并集,则为每个`len(U)`进程提供一个包含`U`元素之一的单例前缀,并更改代码,否则它将寻找第一个元组位置的替代方案.大约有很多方法可以将工作分散开来;-)但是,找到一种最佳方法可能很难,因为我们没有一种有效的方法来预先计算出如何在给定的元组范围内使用_many_结果元组. (2认同)

小智 3

您可以实现一个可哈希列表并使用 python set() 来过滤所有重复项。您的哈希函数只需要忽略列表中的顺序,这可以通过使用 collections.Counter 来实现

from collections import Counter

class HashableList(list):
    def __hash__(self):
        return hash(frozenset(Counter(self)))
    def __eq__(self, other):
        return hash(self) == hash(other)

x = HashableList([1,2,3])
y = HashableList([3,2,1])

print set([x,y])
Run Code Online (Sandbox Code Playgroud)

这将返回:

set([[1, 2, 3]])
Run Code Online (Sandbox Code Playgroud)

  • 我不明白子类有什么意义。只需执行 `set(map(lambda x: freezeset(Counter(x)), the_sequence))`...内置子类化实际上*永远*不是正确的答案。它们的行为并不像大多数人期望的那样,在这里,除了将 `map(lambda x: freezeset(Counter(x)), ...)` 更改为 `map(HashableList, ...) 之外,您并没有真正实现任何目标。 )`。 (3认同)