如何从文本文档中查找常用短语

Question

如何从文本文档中查找常用短语

我有一个包含大量评论/句子的文本文件，我想以某种方式找到文档本身中重复的最常见短语。我试着用 NLTK 稍微摆弄一下，我发现了这个线程：如何从一系列文本条目中提取常见/重要的短语

然而，在尝试之后，我得到了如下奇怪的结果：

>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)
[('m', 'e'), ('t', 's')]

Run Code Online (Sandbox Code Playgroud)

在另一个“这很有趣”这个短语很常见的文件中，我得到一个空列表 []。

我该怎么做呢？

这是我的完整代码：

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words('MkXVM6ad9nI.txt')

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Vee*_*rac 6

我没有使用过nltk，但我怀疑问题是from_words接受字符串或令牌（？）对象。

类似于

with open('MkXVM6ad9nI.txt') as wordfile:
    text = wordfile.read)

tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)

Run Code Online (Sandbox Code Playgroud)

可能会起作用，尽管也可能有专门的文件 API。

归档时间：	11 年，7 月前
查看次数：	3066 次
最近记录：	11 年，7 月前