我想生成大量文本中最不常见单词的有序列表,其中最常见的单词首先出现,并且值指示它在文本中出现的次数.
我从一些在线期刊文章中删除了文本,然后简单地分配和分割;
article_one = """ large body of text """.split()
=> ("large","body", "of", "text")
Run Code Online (Sandbox Code Playgroud)
看起来像正则表达式适合接下来的步骤,但是对编程不熟悉我不太精通 - 如果最好的答案包括正则表达式,有人能指出我除了pydoc之外的一个很好的正则表达式教程吗?
母舰已经做好了回答。
# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
... cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
>>> def least_common(adict, n=None):
.....: if n is None:
.....: return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....: return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))
Run Code Online (Sandbox Code Playgroud)
显然适应套件:D