检查python列表/ numpy ndarray中是否存在重复项的最快方法

Question

检查python列表/ numpy ndarray中是否存在重复项的最快方法

Sco*_*nte 3 python arrays numpy list unique

我想确定我的列表（实际上是numpy.ndarray）是否包含最快重复执行时间的重复项。请注意，我不在乎删除重复项，我只是想知道是否有重复项。

注意：如果这不是重复的内容，我会感到非常惊讶，但是我已经尽力而为，找不到了。这个问题和这个问题最接近，这两个问题都要求返回唯一列表。

Answer 1

Sco*_*nte 5

这是我想到的四种方法。

TL; DR：如果您期望很少（少于1/1000）重复：

def contains_duplicates(X):
    return len(np.unique(X)) != len(X)

Run Code Online (Sandbox Code Playgroud)

如果您希望频繁（超过1/1000）重复，请执行以下操作：

def contains_duplicates(X):
    seen = set()
    seen_add = seen.add
    for x in X:
        if (x in seen or seen_add(x)):
            return True
    return False

Run Code Online (Sandbox Code Playgroud)

第一种方法是从此答案中提早退出，希望返回唯一值，第二种方法是与该答案相同的想法。

>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...:     elems = set()
...:     for i in X:
...:         if i in elems:
...:             return True
...:         elems.add(i)
...:     return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...:     seen = set()
...:     seen_add = seen.add
...:     for x in X:
...:         if (x in seen or seen_add(x)):
...:             return True
...:     return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop

Run Code Online (Sandbox Code Playgroud)

如果您以普通的Python列表而不是a开头，情况是否会发生变化numpy.ndarray？

>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop

Run Code Online (Sandbox Code Playgroud)

编辑：如果我们对重复的数量有事先的期望怎么办？

上述比较是在以下假设下进行的：a）可能没有重复项，或者b）我们比平均情况更担心最坏的情况。

>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>>     print("{} duplicates".format(n_duplicates))
>>>     duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>>     X[duplicate_idx] = 0
>>>     print("terhost_early_exit")
>>>     %timeit terhorst_early_exit(X)
>>>     print("peterbe_early_exit")
>>>     %timeit peterbe_early_exit(X)
>>>     print("set length")
>>>     %timeit len(set(X)) != len(X)
>>>     print("numpy unique length")
>>>     %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop

Run Code Online (Sandbox Code Playgroud)

因此，如果您期望很少的重复项，那么可以使用该numpy.unique函数。随着预期重复次数的增加，早期退出方法占主导地位。

我不确定这是否是公平的测试。您的数据几乎不可能重复，因此提前退出毫无意义。如果现实生活中的数据集通常有10个重复项，那么提早退出将使它快10倍。 (2认同)

归档时间：	7 年，8 月前
查看次数：	2551 次
最近记录：	6 年，2 月前