Python优化了如何在列表中查找重复值和值索引

Gui*_*ain 3 python optimization performance list duplicates

我有一个包含18 000个唯一ID的列表.ID是字母的串联A, B, C, D.我创建了一个代码组ID ID,ID[0:-1]并给出了重复ID的索引位置.

这是很好的作品,但它是非常长的时间进行:周围110 secs18 000 ID.你有想法加快我的代码吗?

a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']

startTime = time.time()
b = [i[0:-1] for i in a]
b = list(set(b))


result = range(len(b))
it = 0
for i in result:
    result[i] = [b[i], []]
    for j in xrange(len(a)):
        if b[i] == a[j][0:-1]:
            result[i][1].append(j)

endTime =  time.time()

print endTime - startTime, 'secs !'
Run Code Online (Sandbox Code Playgroud)

输出:

>>> [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]
Run Code Online (Sandbox Code Playgroud)

Jun*_*sor 5

这就是python中的groupby有效:

from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
Run Code Online (Sandbox Code Playgroud)

输出:

[['1BCABCCCA', [3, 5]], ['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]]]
Run Code Online (Sandbox Code Playgroud)


Kas*_*mvd 5

作为这种问题的更多Pythonic方式使用collections.defaultdict:

>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> new=[i[:-1] for i in a]

>>> d=defaultdict(list)
>>> for i,j in enumerate(new):
...    d[j].append(i)
... 
>>> d
defaultdict(<type 'list'>, {'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]})
>>> d.items()
[('1CDABCABD', [0, 1, 2]), ('1DDAABBBB', [4]), ('1BCABCCCA', [3, 5])]
Run Code Online (Sandbox Code Playgroud)

请注意,这defaultdict是一个线性解决方案,并且比itertools.groupby和更有效sorted.

你也可以使用dict.setdefault方法:

>>> d={}
>>> for i,j in enumerate(new):
...   d.setdefault(j,[]).append(i)
... 
>>> d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
Run Code Online (Sandbox Code Playgroud)

有关详细信息,请查看以下基准测试标记,速度快〜4倍:

s1="""
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
   d.setdefault(j,[]).append(i)
d.items()
    """


print ' first: ' ,timeit(stmt=s1, number=100000)
print 'second : ',timeit(stmt=s2, number=100000)
Run Code Online (Sandbox Code Playgroud)

结果:

 first:  0.949549913406
second :  0.250894069672
Run Code Online (Sandbox Code Playgroud)