use*_*950 3 python nltk word-frequency
我正在使用NLTK并尝试将单词短语计数到特定文档的某个长度以及每个短语的频率.我将字符串标记为获取数据列表.
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import *
data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"]
bigrams = ngrams(data, 2)
bigrams_c = {}
for b in bigrams:
if b not in bigrams_c:
bigrams_c[b] = 1
else:
bigrams_c[b] += 1
Run Code Online (Sandbox Code Playgroud)
上面的代码给出和输出如下:
(('is', 'this'), 1)
(('test', 'this'), 2)
(('a', 'test'), 3)
(('this', 'is'), 4)
(('is', 'not'), 1)
(('real', 'not'), 2)
(('is', 'real'), 2)
(('not', 'a'), 3)
Run Code Online (Sandbox Code Playgroud)
这是我正在寻找的部分内容.
我的问题是,是否有更方便的方法来说明长度为4或5的短语而不重复此代码只更改计数变量?
ale*_*xis 12
因为你标记了这个nltk,所以这里是如何使用nltk的方法,它具有比标准python集合更多的功能.
from nltk import ngrams, FreqDist
all_counts = dict()
for size in 2, 3, 4, 5:
all_counts[size] = FreqDist(ngrams(data, size))
Run Code Online (Sandbox Code Playgroud)
字典的每个元素all_counts都是ngram频率的字典.例如,您可以获得五个最常见的三元组:
all_counts[3].most_common(5)
Run Code Online (Sandbox Code Playgroud)