查找列表中最不常见的元素

Question

查找列表中最不常见的元素

我想生成大量文本中最不常见单词的有序列表,其中最常见的单词首先出现,并且值指示它在文本中出现的次数.

我从一些在线期刊文章中删除了文本,然后简单地分配和分割;

article_one = """ large body of text """.split() 
=> ("large","body", "of", "text")

Run Code Online (Sandbox Code Playgroud)

看起来像正则表达式适合接下来的步骤,但是对编程不熟悉我不太精通 - 如果最好的答案包括正则表达式,有人能指出我除了pydoc之外的一个很好的正则表达式教程吗？

Answer 1

sot*_*pme 0

母舰已经做好了回答。

# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

>>> def least_common(adict, n=None):
.....:       if n is None:
.....:               return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....:       return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))

Run Code Online (Sandbox Code Playgroud)

显然适应套件:D

@sotapme 较短版本：`collections.Counter(article_one.split())` (3认同)

归档时间：	12 年，9 月前
查看次数：	4943 次
最近记录：	7 年，9 月前