用于K个不同的M大小的子集的确定性python生成器

Ero*_*mic 5 python numpy generator

我正在寻找一种从大小为N的集合S生成许多M大小子集的有效方法.

理想情况下,我想生成所有这些,但因为我将它们用于其他计算,这变得不可行.

相反,我想生成K个不同的S子集,使得K个选择的子集最小化K个子集之间的所有成对交叉的大小的总和.

换句话说,如果我有K个子集,我采用所有这些子集的成对交集.然后我将所有这些集合的大小加在一起.我得到的数字尽可能低.

基本上,我希望这些子集彼此"远离"是可能的.我一直在想我将如何做到这一点,但我画了一个空白.

在此期间模拟它我写了这个函数

        def subset_split(full_set, M, K):
            np.random.seed(0) # repeatibility
            seen = set([])
            subset_list = []
            for kx in xrange(K):
                np.random.shuffle(full_set)
                failsafe = 0
                while True: 
                    np.random.shuffle(full_set)
                    subset = tuple(full_set[0:M])
                    if not subset in seen: 
                        seen.add(subset)
                        subset_list.append(subset)
                        break
                    failsafe += 1
                    if failsafe > 100:
                        break
            return subset_list
Run Code Online (Sandbox Code Playgroud)

它只生成以前没有见过的K个随机子集.但这并不是我想要的,因为我希望那些K子集是可重复的,并且如果它们不必如此,则不会意外地接近每个子集.

Tim*_*ers 7

好吧,还在玩这个;-)

在这种情况下,很明显,我们试图尽量减少:

sum(n*(n-1)//2 for n in index2count.values())
Run Code Online (Sandbox Code Playgroud)

如果所有值都相同(如果可能的话),那么这是最小的,或者相隔两个不同的值(例如,如果len(full_set) = 10,7个3和3个4).这足以编写一个根本无需计算的生成器index2count.而是hi具有两个值中较大值的索引集,以及lo其余索引(lo如果所有(概念上的!未计算的)值都相同,则为空).

这会删除K参数,并对重复采用不同的方法:它会忽略它们.在这里跟踪重复是昂贵和笨拙的,并且在"随机 - 发送"生成器中应该真正期待重复.如果这困扰你,请将其包装在另一个例程中以清除重复项.

样本输出:

>>> from itertools import islice
>>> for s in islice(gen_subsets_special(set(range(5)), 1), 10):
...    print s
set([4])
set([3])
set([0])
set([1])
set([2])
set([0])
set([3])
set([1])
set([2])
set([4])

>>> for s in islice(gen_subsets_special(set(range(10, 20)), 3), 10):
...    print s
set([17, 18, 10])
set([11, 12, 14])
set([16, 19, 13])
set([12, 13, 15])
set([17, 10, 11])
set([16, 19, 15])
set([17, 18, 14])
set([16, 18, 13])
set([19, 12, 15])
set([10, 11, 14])
Run Code Online (Sandbox Code Playgroud)

这是代码:

def gen_subsets_special(full_set, M, seed=123456):
    # generate randomish M-subsets of full_set, "far apart".
    import random
    from random import sample
    random.seed(seed)
    elements = list(full_set)
    N = len(elements)
    hi = set(range(N))
    lo = set()
    while True:
        assert not (hi & lo)
        assert len(lo | hi) == N
        # First take indices from hi, then (if needed) from lo.
        if len(hi) > M:
            # We can take all of them from hi, with some left over.
            ixhi = set(sample(hi, M))
            ixlo = set()
            # The ixhi counts go down by 1, so move 'em to lo
            hi -= ixhi
            lo |= ixhi
            assert hi
        else:
            # We need to take everything in hi.
            ixhi = hi.copy()
            ixlo = set(sample(lo, M - len(ixhi)))
            hi |= lo - ixlo
            lo = ixlo
        assert not (ixlo & ixhi)
        ix = ixlo | ixhi
        assert len(ix) == M
        yield set(elements[i] for i in ix)
Run Code Online (Sandbox Code Playgroud)

通过构造,所生成的序列的每个前缀最小化前缀中所有对集合的交叉点的大小的总和.或者现在在我看来;-)

最后,这种变化更明显地反复循环通过所有指数.这可能是我实际使用的那个:

def gen_subsets_special(full_set, M, seed=123456):
    # generate randomish M-subsets of full_set, "far apart".
    import random
    from random import sample
    random.seed(seed)
    elements = list(full_set)
    allix = set(range(len(elements)))
    takefrom = allix.copy()

    def destructive_sample(n):
        # Remove a random n-subset from takefrom, & return it.
        s = set(sample(takefrom, n))
        takefrom.difference_update(s)
        return s

    while True:
        if len(takefrom) >= M:
            # Get everything from takefrom.
            ix = destructive_sample(M)
        else:
            # We need to take all of takefrom, and more.
            ix = takefrom
            takefrom = allix - ix
            ix |= destructive_sample(M - len(ix))
        assert len(ix) == M
        yield set(elements[i] for i in ix)
Run Code Online (Sandbox Code Playgroud)

  • 是的,那是我 - 我已经老了,但对这个网站来说是新手,所以"upvote"我的答案 - 大声笑;-)希望代码有用!如果我能澄清任何事情,请问. (5认同)