有效删除元组列表中的部分重复项

Question

有效删除元组列表中的部分重复项

ksp*_*spr 9 python performance tuples list

我有一个元组列表，列表的长度可以在 ~8 - 1000 之间变化，具体取决于元组的长度。列表中的每个元组都是唯一的。元组的长度为 N，其中每个条目都是一个通用词。

一个示例元组的长度可以是 N (Word 1, Word 2, Word 3, ..., Word N)

对于列表中的任何元组，所述元组中的元素 j 将是''或Word j

一个非常简单的带有字母的例子是

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

Run Code Online (Sandbox Code Playgroud)

每个元组的每个位置要么具有相同的值，要么为空。我想删除所有''在同一位置的另一个元组中具有所有非值的元组。例如，(A,B,'','')包含所有非''值，(A,B,C,'')因此应删除。

filtered_l = [(A,B,C,''),(A,'','',D)]

Run Code Online (Sandbox Code Playgroud)

元组的长度总是相同的（不一定是 4）。元组的长度将在 2-10 之间。

执行此操作的最快方法是什么？

Answer 1

Gre*_*Guy 6

让我们将每个元组概念化为一个二元数组，其中 1 是“包含某物”，2 是“包含空字符串”。由于在每个位置的项目将是相同的，我们并不需要关心什么是在每个位置，只有那东西。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

Run Code Online (Sandbox Code Playgroud)

现在，我们可以遍历该列表并构建一个没有“重复”的新数据结构。由于我们将元组编码为二进制，因此我们可以通过执行按位运算来确定一个重复的，被另一个“包含” - 给定a和b，如果a | b == a，则a必须包含b。

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

Run Code Online (Sandbox Code Playgroud)

现在我们可以撤回我们的“过滤”元组列表：

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

Run Code Online (Sandbox Code Playgroud)

请注意，(A, B, C, '')既包含(A, B, '', '')和('', B, '', '')，又(A, '', '', D')包含('', '', '', D)，所以这应该是正确的。

从 python 3.8 开始，dict保留插入顺序，因此输出应该与元组最初出现在列表中的顺序相同。

这个解决方案不是非常有效，因为代码的数量可能会堆积起来，但它应该在 O(n) 和 O(n^2) 之间，这取决于最后剩下的唯一代码的数量（并且因为每个元组的长度明显小于的长度l，它应该更接近 O(n) 而不是 O(n^2)。

Answer 2

use*_*729 5

特别是对于这个限制，显而易见的解决方案是将每个元组转换为位掩码，将它们累加到一个计数器数组中，执行子集和转换，然后过滤数组l。

请参阅注释中的详细代码说明。

时间复杂度显然是n + m * 2^m，其中n是元组的数量，m是每个元组的长度。对于n == 1000和m == 10，这显然比快n^2。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，11 月前
查看次数：	543 次
最近记录：	4 年，10 月前