如何使此列表功能更快?

use*_*364 7 python algorithm optimization dictionary list

def removeDuplicatesFromList(seq): 
    # Not order preserving 
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

def countWordDistances(li):
    '''
    If li = ['that','sank','into','the','ocean']    
    This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
    However, if there is a duplicate term, take the average of their positions
    '''
    wordmap = {}
    unique_words = removeDuplicatesFromList(li)
    for w in unique_words:
        distances = [i+1 for i,x in enumerate(li) if x == w]
        wordmap[w] = float(sum(distances)) / float(len(distances)) #take average
    return wordmap
Run Code Online (Sandbox Code Playgroud)

如何更快地完成此功能?

Ned*_*der 15

import collections
def countWordDistances(li):
    wordmap = collections.defaultdict(list)
    for i, w in enumerate(li, 1):
        wordmap[w].append(i)
    for k, v in wordmap.iteritems():
        wordmap[k] = sum(v)/float(len(v))

    return wordmap
Run Code Online (Sandbox Code Playgroud)

这使得只有一个通过列表,并将操作保持在最低限度.我在一个包含1.1M条目,29k个独特单词的单词列表上计时,它几乎是Patrick的答案的两倍.在10k字的列表中,2k是唯一的,它比OP的代码快300倍.

为了使Python代码更快,需要记住两个规则:使用最佳算法,并避免使用Python.

在算法前端,迭代列表一次而不是N + 1次(N =唯一字的数量)是加速这一点的主要因素.

在"避免Python"方面,我的意思是:你希望你的代码尽可能在C中执行.所以使用defaultdict比明确检查密钥是否存在的字典更好. defaultdict这会检查你,但在C实现中,它在Python实现中. enumerate更好for i in range(len(li)),因为它的Python步骤更少.并且enumerate(li, 1)使计从1开始,而不必在环路中一个Python +1地方.

编辑:第三条规则:使用PyPy.我的代码在PyPy上的速度是2.7的两倍.


Nei*_*l G 5

基于@Ned Batchelder的解决方案,但没有创建虚拟列表:

import collections
def countWordDistances(li):
    wordmap = collections.defaultdict(lambda:[0.0, 0.0])
    for i, w in enumerate(li, 1):
        wordmap[w][0] += i
        wordmap[w][1] += 1.0
    for k, (t, n) in wordmap.iteritems():
        wordmap[k] = t / n
    return wordmap
Run Code Online (Sandbox Code Playgroud)


One*_*One 1

使用一套:

def countWordDistances(li):
    '''
    If li = ['that','sank','into','the','ocean']    
    This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
    However, if there is a duplicate term, take the average of their positions
    '''
    wordmap = {}
    unique_words = set(li)
    for w in unique_words:
        distances = [i+1 for i,x in enumerate(li) if x == w]
        wordmap[w] = float(sum(distances)) / float(len(distances)) #take average
    return wordmap
Run Code Online (Sandbox Code Playgroud)