Ryd*_*oks 7 python algorithm combinations numpy set
我的目标是找到尽可能少的子集[af]来组成全集A.
A = set([1,2,3,4,5,6,7,8,9,10]) # full set
#--- below are sub sets of A ---
a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])
Run Code Online (Sandbox Code Playgroud)
实际上,我正在处理的父集A包含15k个独特元素,具有30k个子集,这些子集的长度范围从单个唯一元素到1.5k个唯一元素.
截至目前,我正在使用的代码看起来或多或少像以下一样,并且很慢:
import random
B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)
Dict = {}
for i in Bx: # iterate through shuffled keys.
z = [i]
x = B[i]
L = len(x)
while L < len(A):
for ii in Bx:
x = x | B[ii]
Lx = len(x)
if Lx > L:
L = Lx
z.append(ii)
try:
Dict[len(z)].append(z)
except KeyError:
Dict[len(z)] = [z]
print Dict[min(Dict.keys()]
Run Code Online (Sandbox Code Playgroud)
这只是为了说明我采取的方法.为了清楚起见,我省略了一些逻辑,这些逻辑可以最大限度地减少已经太大的集合上的迭代以及其他类似的事情.
我想Numpy真的很擅长这类问题,但我想不出一种方法可以使用它.
问题是要求实现集合覆盖问题,但不存在快速算法来找到最佳解决方案。然而,问题的贪婪解决方案——重复选择包含最多尚未覆盖的元素的子集——在合理的时间内做得很好。
您可以在上一个问题中找到该算法在 python 中的实现
编辑添加:@Aaron Hall 的答案可以通过使用以下替代方案来改进他的greedy_set_cover 例程。len(s-result_set)在 Aaron 的代码中,每次我们想要将子集添加到封面时,我们都会计算每个剩余子集的分数。然而,当我们添加到 result_set 中时,这个分数只会降低;因此,如果在当前迭代中,我们选择了迄今为止最好的一组,其分数高于先前迭代中获得的其余子集,我们知道它们的分数无法提高,可以忽略它们。这建议使用优先级队列来存储要处理的子集;在Python中我们可以用以下方法实现这个想法heapq:
# at top of file
import heapq
#... etc
# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
max = len(parent_set)
# create the initial heap. Note 'subsets' can be unsorted,
# so this is independent of whether remove_redunant_subsets is used.
heap = []
for s in subsets:
# Python's heapq lets you pop the *smallest* value, so we
# want to use max-len(s) as a score, not len(s).
# len(heap) is just proving a unique number to each subset,
# used to tiebreak equal scores.
heapq.heappush(heap, [max-len(s), len(heap), s])
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
best = []
unused = []
while heap:
score, count, s = heapq.heappop(heap)
if not best:
best = [max-len(s - result_set), count, s]
continue
if score >= best[0]:
# because subset scores only get worse as the resultset
# gets bigger, we know that the rest of the heap cannot beat
# the best score. So push the subset back on the heap, and
# stop this iteration.
heapq.heappush(heap, [score, count, s])
break
score = max-len(s - result_set)
if score >= best[0]:
unused.append([score, count, s])
else:
unused.append(best)
best = [score, count, s]
add_set = best[2]
logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
results.append(add_set)
result_set.update(add_set)
# subsets that were not the best get put back on the heap for next time.
while unused:
heapq.heappush(heap, unused.pop())
return results
Run Code Online (Sandbox Code Playgroud)
为了进行比较,以下是我笔记本电脑上 Aaron 代码的时间安排。我删除了remove_redundant_subsets,因为当我们使用堆时,主导子集永远不会被重新处理:
INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46
Run Code Online (Sandbox Code Playgroud)
这是上面代码的时间;快了 3 倍多一点。
INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46
Run Code Online (Sandbox Code Playgroud)
注意:这两种算法以不同的顺序处理子集,并且偶尔会对集合覆盖的大小给出不同的答案;这只是分数相同时子集的“幸运”选择。
优先级队列/堆是众所周知的贪婪算法优化,尽管我找不到对此进行适当讨论的链接。
虽然贪心算法是获得近似答案的快速方法,但您可以通过事后花时间来改进答案,因为知道我们有最小集合覆盖的上限。这样做的技术包括模拟退火或分支定界算法,如Peter Norvig 的文章中所述