有没有办法使collections.Counter(Python2.7)意识到它的输入列表是排序的？

Question

有没有办法使collections.Counter(Python2.7)意识到它的输入列表是排序的？

use*_*272 9 python performance counter python-2.7

问题

我一直在玩不同的方式(在Python 2.7中)从语料库或字符串列表中提取(单词,频率)元组列表,并比较它们的效率.据我所知,在正常情况下列表未排序的情况下,模块中的Counter方法collections优于我在其他地方提出或找到的任何方法,但它似乎没有太大的好处.预先排序的列表,我已经提出了在这种特殊情况下轻松击败它的方法.那么,简而言之,是否有任何内置的方法来告知Counter列表已经排序以进一步加快它的速度？

(下一部分是未分类的列表,其中Counter工作魔法;你可能想要在处理排序列表时跳到它失去魅力的那一端.)

未排序的输入列表

一种方法不起作用

天真的方法是使用sorted([(word, corpus.count(word)) for word in set(corpus)]),但一个可靠的,只要你的语料库是几千个条目你进入运行时的问题-这并不奇怪,因为你通过的n个字的完整列表运行男也曾多次,其中m为唯一单词的数量.

对列表+本地搜索进行排序

因此,我试图做之前,而不是我发现的Counter是确保所有的搜索都是严格的地方,首先分拣输入表(我也有删除的数字和标点符号和所有条目转换为小写,以避免像"富"重复, 'Foo'和'foo:').

#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk 

# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues 
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words()) 

import string
drop = string.punctuation+string.digits  

def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
    '''function for extracting word frequencies out of a corpus. Returns an alphabetic list
    of tuples consisting of words contained in the corpus with their frequencies.  
    Default is case-insensitive, but if you need separate entries for upper and lower case 
    spellings of the same words, set option Case=True. If your input list is already sorted
    and stripped of punctuation marks/digits and/or all lower case, you can accelerate the 
    operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
    # you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
    if lower or Case:
        if StrippedSorted:
            sortedc = corpus 
        else:    
            sortedc = sorted([word.replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # here we sort and purge the input list in the default case:
    else:
            sortedc = sorted([word.lower().replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # start iterating over the (sorted) input list:
    scindex = 0
    # create a list:
    freqs = []
    # identify the first token:
    currentword = sortedc[0]
    length = len(sortedc)
    while scindex < length:
        wordcount = 0
        # increment a local counter while the tokens == currentword
        while scindex < length and sortedc[scindex] == currentword:
            scindex += 1
            wordcount += 1
        # store the current word and final score when a) a new word appears or
        # b) the end of the list is reached
        freqs.append((currentword, wordcount))
        # if a): update currentword with the current token
        if scindex < length:
            currentword = sortedc[scindex]
    return freqs

归档时间：	12 年，11 月前
查看次数：	1891 次
最近记录：	8 年，7 月前

有没有办法使collections.Counter(Python2.7)意识到它的输入列表是排序的？

问题

未排序的输入列表

一种方法不起作用

对列表+本地搜索进行排序

查找 collections.Counter

比较性能

排序输入列表 - Counter'失败'时

编辑12月1日:

查找 `collections.Counter`

排序输入列表 - `Counter`'失败'时