有没有办法使collections.Counter(Python2.7)意识到它的输入列表是排序的?

use*_*272 9 python performance counter python-2.7

问题

我一直在玩不同的方式(在Python 2.7中)从语料库或字符串列表中提取(单词,频率)元组列表,并比较它们的效率.据我所知,在正常情况下列表未排序的情况下,模块中的Counter方法collections优于我在其他地方提出或找到的任何方法,但它似乎没有太大的好处.预先排序的列表,我已经提出了在这种特殊情况下轻松击败它的方法.那么,简而言之,是否有任何内置的方法来告知Counter列表已经排序以进一步加快它的速度?

(下一部分是未分类的列表,其中Counter工作魔法;你可能想要在处理排序列表时跳到它失去魅力的那一端.)

未排序的输入列表

一种方法不起作用

天真的方法是使用sorted([(word, corpus.count(word)) for word in set(corpus)]),但一个可靠的,只要你的语料库是几千个条目你进入运行时的问题-这并不奇怪,因为你通过的n个字的完整列表运行男也曾多次,其中m为唯一单词的数量.

对列表+本地搜索进行排序

因此,我试图做之前,而不是我发现的Counter是确保所有的搜索都是严格的地方,首先分拣输入表(我也有删除的数字和标点符号和所有条目转换为小写,以避免像"富"重复, 'Foo'和'foo:').

#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk 

# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues 
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words()) 

import string
drop = string.punctuation+string.digits  

def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
    '''function for extracting word frequencies out of a corpus. Returns an alphabetic list
    of tuples consisting of words contained in the corpus with their frequencies.  
    Default is case-insensitive, but if you need separate entries for upper and lower case 
    spellings of the same words, set option Case=True. If your input list is already sorted
    and stripped of punctuation marks/digits and/or all lower case, you can accelerate the 
    operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
    # you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
    if lower or Case:
        if StrippedSorted:
            sortedc = corpus 
        else:    
            sortedc = sorted([word.replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # here we sort and purge the input list in the default case:
    else:
            sortedc = sorted([word.lower().replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # start iterating over the (sorted) input list:
    scindex = 0
    # create a list:
    freqs = []
    # identify the first token:
    currentword = sortedc[0]
    length = len(sortedc)
    while scindex < length:
        wordcount = 0
        # increment a local counter while the tokens == currentword
        while scindex < length and sortedc[scindex] == currentword:
            scindex += 1
            wordcount += 1
        # store the current word and final score when a) a new word appears or
        # b) the end of the list is reached
        freqs.append((currentword, wordcount))
        # if a): update currentword with the current token
        if scindex < length:
            currentword = sortedc[scindex]
    return freqs
Run Code Online (Sandbox Code Playgroud)

查找 collections.Counter

这要好得多,但仍然不如使用collections模块中的Counter类那么快,后者创建了一个{word:frequency of words}条目的字典(我们仍然需要进行相同的剥离和降低,但没有排序) :

from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
    cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
Run Code Online (Sandbox Code Playgroud)

关于Gutenberg语料库与appr.200万个条目,Counter方法在我的机器上快了大约30%(5秒而不是7.2),这主要通过排序例程解释,大约2.1秒(如果你没有和不想要)安装nltk软件包(自然语言工具包)可以访问这个语料库,任何其他适当分割成单词级别的字符串列表的足够长的文本都会显示相同的内容.)

比较性能

使用我的特殊的计时方法,使用重言式作为延迟执行的条件,这为我们提供了计数器方法:

import time
>>> if 1:
...     start = time.time()
...     cnt = Counter()
...     for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
...         cnt[word] += 1
...     time.time()-start
...     cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
...     time.time()-start
... 
4.999882936477661
5.191655874252319
Run Code Online (Sandbox Code Playgroud)

(我们看到最后一步,即将结果格式化为元组列表,占用总时间的不到5%.)

与我的功能相比:

>>> if 1:
...     start = time.time()
...     gbfreqs = wordcount5(corpus)
...     time.time()-start
... 
7.261770963668823
Run Code Online (Sandbox Code Playgroud)

排序输入列表 - Counter'失败'时

但是,您可能已经注意到,我的函数允许指定输入已经排序,去除了标点垃圾,并转换为小写.如果我们已经为其他一些操作创建了这样一个转换版本的列表,那么使用它(并声明这样做)可以大大加快我的操作wordcount5:

>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
...     start = time.time()
...     strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
...     time.time()-start
... 
0.9050078392028809
Run Code Online (Sandbox Code Playgroud)

在这里,我们将运行时间减少了一倍.8不必对语料库进行排序并转换项目.当然后者在Counter使用这个新列表时也是如此,所以可以预期它也会更快一些,但它似乎没有利用它被排序的事实,现在需要的时间是我的函数的两倍.之前快30%:

>>> if 1:
...     start = time.time()
...     cnt = Counter()
...     for word in sorted_corpus:
...         cnt[word] += 1
...     time.time()-start
...     strippedgbfreqs = [(word, cnt[word])for word in cnt]
...     time.time()-start
... 
1.9455058574676514
2.0096349716186523
Run Code Online (Sandbox Code Playgroud)

当然,我们可以使用我用过的相同逻辑wordcount5- 递增本地计数器,直到我们遇到一个新字,然后只用计数器的当前状态存储最后一个字,并将计数器重置为0,用于下一个字 -只使用Counter存储,但方法的固有效率Counter似乎丢失了,性能在我的函数创建字典的范围内,转换为元组列表的额外负担现在看起来比我们以前更麻烦处理原始语料库:

>>> def countertest():
...     start = time.time()
...     sortedcnt = Counter()
...     c = 0
...     length = len(sorted_corpus)
...     while c < length:
...         wcount = 0
...         word = sorted_corpus[c]
...         while c < length and sorted_corpus[c] == word:
...             wcount+=1
...             c+=1
...         sortedcnt[word] = wcount
...         if c < length:
...             word = sorted_corpus[c]
...     print time.time()-start
...     return sorted([(word, sortedcnt[word]) for word in sortedcnt])
...     print time.time()-start
... 
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912
Run Code Online (Sandbox Code Playgroud)

(结果的相似性并不令人惊讶,因为我们实际上禁用了Counter自己的方法并滥用它作为使用与以前相同的方法获得的值的存储.)

因此,我的问题是:是否有更惯用的方式来通知Counter其输入列表已经排序并使其保持当前密钥在内存中而不是每次重新查找它 - 可预测地 - 遇到它的下一个标记字?换句话说,是否有可能通过将Counter/ dictionaryclass 的固有效率与排序列表的明显优势相结合,进一步提高预排序列表的性能,或者我已经在计算了0.9秒的硬限制上2M条目列表?

可能没有很大的改进空间 - 当我做一些我能想到的最简单的事情仍然需要迭代同一个列表并检查每个单独的值时,我得到大约.55秒的时间,而.25表示set(corpus)没有伯爵,但也许有一些itertools魔法有助于接近这些数字?

(注意:我是Python的相对新手和一般的编程,所以如果我错过了一些显而易见的东西,那么这个借口.)

编辑12月1日:

另外一件事,除了排序本身,使我的所有方法都变得缓慢之外,将2M字符串中的每一个都转换为小写,并剥离它们可能包含的任何标点符号或数字.我之前尝试通过计算未处理的字符串来快捷方式,然后只是转换结果并删除重复项,同时将它们的计数加起来,但我必须做错了,因为它使得事情变得如此轻微.因此,我恢复到以前的版本,转换原始语料库中的所有内容,现在无法完全重建我在那里所做的事情.

如果我现在尝试,我会从最后转换字符串得到改进.我仍然通过循环一个列表(结果)来做到这一点.我所做的是写几个函数,它们之间将JF Sebastian的获胜default_dict方法(格式[("word", int), ("Word", int)], ("word2", int),...])的输出中的键转换为小写并删除它们的标点符号,并折叠之后保留相同的所有键的计数操作(代码如下).优点是我们现在处理大约50k条目的列表,而不是语料库中的> 2M.这样,我现在是在1.25秒从语料库下去(列表)来区分大小写的字数在我的机器上忽略标点符号,约4.5下降与反法并串转换的第一步.但也许有一个基于字典的方法,我也在做什么sum_sorted()

码:

def striplast(resultlist, lower_or_Case=False):
    """function for string conversion of the output of any of the `count_words*` methods"""
    if lower_or_Case:
        strippedresult = sorted([(entry[0].strip(drop), entry[1]) for entry in resultlist])
    else:
        strippedresult = sorted([(entry[0].lower().strip(drop), entry[1]) for entry in resultlist])
    strippedresult = sum_sorted(strippedresult)
    return strippedresult

def sum_sorted(inputlist):
    """function for collapsing the counts of entries left identical by striplast()"""
    ilindex = 0
    freqs = []
    currentword = inputlist[0][0]
    length = len(inputlist)
    while ilindex < length:
        wordcount = 0
        while ilindex < length and inputlist[ilindex][0] == currentword:
            wordcount += inputlist[ilindex][1]
            ilindex += 1
        if currentword not in ["", " "]:
            freqs.append((currentword, wordcount))
        if ilindex < length and inputlist[ilindex][0] > currentword:
            currentword = inputlist[ilindex][0]
    return freqs

def count_words_defaultdict2(words, loc=False): 
    """modified version of J.F. Sebastian's winning method, added a final step collapsing
    the counts for words identical except for punctuation and digits and case (for the 
    latter, unless you specify that you're interested in a case-sensitive count by setting
    l(ower_)o(r_)c(ase) to True) by means of striplast()."""
    d = defaultdict(int)
    for w in words:
        d[w] += 1
    if col=True:
        return striplast(sorted(d.items()), lower_or_case=True)
    else:
        return striplast(sorted(d.items()))
Run Code Online (Sandbox Code Playgroud)

我做了一些第一次尝试使用groupy来完成目前所完成的工作sum_sorted(),和/或striplast(),但我无法弄清楚如何欺骗它来对"排序的结果[entry[1]]中的条目列表求和count_words" entry[0].我得到的最接近的是:

# "i(n)p(ut)list", toylist for testing purposes:

list(groupby(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in  iplist])))

# returns:

[(('a', 1), <itertools._grouper object at 0x1031bb290>), (('a', 2), <itertools._grouper object at 0x1031bb250>), (('a', 3), <itertools._grouper object at 0x1031bb210>), (('a', 5), <itertools._grouper object at 0x1031bb2d0>), (('a', 8), <itertools._grouper object at 0x1031bb310>), (('b', 3), <itertools._grouper object at 0x1031bb350>), (('b', 7), <itertools._grouper object at 0x1031bb390>)]

# So what I used instead for striplast() is based on list comprehension:

list(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in  iplist]))

# returns:

[('a', 1), ('a', 2), ('a', 3), ('a', 5), ('a', 8), ('b', 3), ('b', 7)]
Run Code Online (Sandbox Code Playgroud)

Jon*_*nts 7

给出一个排序的单词列表,你有没有尝试过传统的Pythonic方法itertools.groupby

from itertools import groupby
some_data = ['a', 'a', 'b', 'c', 'c', 'c']
count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
# {'a': 2, 'c': 3, 'b': 1}
Run Code Online (Sandbox Code Playgroud)


jfs*_*jfs 7

要回答标题中的问题:Counter,dict,defaultdict,OrderedDict是基于散列的类型:查找项目,他们计算密钥的哈希并使用它来获取项目.它们甚至支持没有定义顺序的键,只要它们是可清除的,即Counter不能利用预先排序的输入.

测量表明,输入单词的排序比使用基于字典的方法对单词进行计数所需的时间更长,并且对结果进行排序:

sorted                  3.19
count_words_Counter     2.88
count_words_defaultdict 2.45
count_words_dict        2.58
count_words_groupby     3.44
count_words_groupby_sum 3.52
Run Code Online (Sandbox Code Playgroud)

此外,对已经排序的输入中的单词的计数groupby()仅占用首先对输入进行排序所花费的时间的一小部分,并且比基于dict的方法更快.

def count_words_Counter(words):
    return sorted(Counter(words).items())

def count_words_groupby(words):
    return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]

def count_words_groupby_sum(words):
    return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]

def count_words_defaultdict(words):
    d = defaultdict(int)
    for w in words:
        d[w] += 1
    return sorted(d.items())

def count_words_dict(words):
    d = {}
    for w in words:
        try:
            d[w] += 1
        except KeyError:
            d[w] = 1
    return sorted(d.items())

def _count_words_freqdist(words):
    # note: .items() returns words sorted by word frequency (descreasing order)
    #       (same as `Counter.most_common()`)
    #       so the code sorts twice (the second time in lexicographical order)
    return sorted(nltk.FreqDist(words).items())
Run Code Online (Sandbox Code Playgroud)

要重现结果,请运行此代码.

注意:如果将nltk的延迟单词序列转换为列表,则速度提高3倍(WORDS = list(nltk.corpus.gutenberg.words())但相对性能相同:

sorted                  1.22
count_words_Counter     0.86
count_words_defaultdict 0.48
count_words_dict        0.54
count_words_groupby     1.49
count_words_groupby_sum 1.55
Run Code Online (Sandbox Code Playgroud)

结果类似于Python - 字典是否很难找到每个字符的频率?.

如果你想标准化单词(删除标点符号,使它们小写等); 查看答案在Python中将字符串转换为全部小写的最有效方法是什么,删除所有非ascii字母字符?.一些例子:

def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
    """
    >>> toascii_letter_lower_genexpr("ABC,-.!def")
    'abcdef'
    """
    return ''.join(c for c in s.lower() if c in _letter_set)

def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
    return ''.join(c for c in s.lower() if c in _letter_set)

def toascii_letter_lower_translate(s,
    table=maketrans(ascii_letters, ascii_lowercase * 2),
    deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
    return s.translate(table, deletechars)

def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
    return filter(_letter_set.__contains__, s).lower()
Run Code Online (Sandbox Code Playgroud)

同时计算和标准化单词:

def combine_counts(items):
    d = defaultdict(int)
    for word, count in items:
        d[word] += count
    return d.iteritems()

def clean_words_in_items(clean_word, items):
    return ((clean_word(word), count) for word, count in items)

def normalize_count_words(words):
    """Normalize then count words."""
    return count_words_defaultdict(imap(toascii_letter_lower_translate, words))

def count_normalize_words(words):
    """Count then normalize words."""
    freqs = count_words_defaultdict(words)
    freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
    return sorted(combine_counts(freqs))
Run Code Online (Sandbox Code Playgroud)

结果

我更新了基准来测量各种组合count_words*()toascii*()功能(5x4对未显示):

toascii_letter_lower_filter      0.954 usec small
toascii_letter_lower_genexpr     2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate   0.633 usec small

toascii_letter_lower_filter      124 usec random 2000
toascii_letter_lower_genexpr     197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate   7.73 usec random 2000

sorted                  1.28 sec 
count_words_Counter     941 msec 
count_words_defaultdict 501 msec 
count_words_dict        571 msec 
count_words_groupby     1.56 sec 
count_words_groupby_sum 1.64 sec 

count_normalize_words 622 msec 
normalize_count_words 2.18 sec 
Run Code Online (Sandbox Code Playgroud)

最快的方法:

  • 规范化词汇 - toascii_letter_lower_translate()

  • 计数字(预先输入) - groupby()基于方法

  • 数字 - count_words_defaultdict()

  • 首先计算单词然后将它们标准化是更快的 - count_normalize_words()

最新版本的代码:count-words-performance.py.