Python:如何计算NLTK语料库中最常用的X词?

Wol*_*lff 6 python nltk

我不确定我是否已正确理解FreqDist函数如何在Python上运行.当我按照教程进行操作时,我会相信以下代码构造给定单词列表的频率分布并计算最常用的x个单词.(在下面的示例中,将语料库设为NLTK语料库,将文件作为该语料库中文件的文件名)

words = corpus.words('file.txt')
fd_words = nltk.FreqDist(word.lower() for word in words)
fd_words.items()[:x]
Run Code Online (Sandbox Code Playgroud)

但是,当我在Python上执行以下命令时,它似乎暗示:

>>> from nltk import *
>>> fdist = FreqDist(['hi','my','name','is','my','name'])
>>> fdist
FreqDist({'my': 2, 'name':2, 'is':1, 'hi':1}
>>> fdist.items()
[('is',1),('hi',1),('my',2),('name',2)]
>>> fdist.items[:2]
[('is',1),('hi',1)]
Run Code Online (Sandbox Code Playgroud)

fdist.items()[:x]方法实际上是返回x个最不常见的单词?

有人能告诉我,如果我做错了什么或错误在于我正在遵循的教程吗?

Jer*_*ski 15

默认情况下,a FreqDist未排序.我想你正在寻找most_common方法:

from nltk import FreqDist
fdist = FreqDist(['hi','my','name','is','my','name'])
fdist.most_common(2)
Run Code Online (Sandbox Code Playgroud)

返回:

[('my', 2), ('name', 2)]
Run Code Online (Sandbox Code Playgroud)

  • `计数器('hi','my','name','is','my','name']).most_common()`也会这样做; P.请参阅:http://stackoverflow.com/questions/34603922/difference-between-pythons-collections-counter-and-nltk-probability-freqdist/34606637#34606637 (3认同)