如何获取 python 或 R 中最常见的短语或单词

Question

如何获取 python 或 R 中最常见的短语或单词

给定一些文本，我如何获得 n=1 到 6 之间最常见的 n 元语法？我见过一些方法来获取 3 克或 2 克的方法，一次一个 n，但是有没有办法提取最有意义的最大长度短语以及所有其余的短语？

例如，在本文中仅用于演示目的： fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.

n-gram 及其计数器的理想结果是：

fri evening commute: 3,
off-peak: 2,
rest of the words: 1

Run Code Online (Sandbox Code Playgroud)

任何建议表示赞赏。谢谢。

Answer 1

Nad*_*xan 5

Python

考虑NLTK库，它提供了一个 ngrams 函数，您可以使用它来迭代 n 的值。

粗略的实现将遵循以下内容，其中rough是此处的关键字：

from nltk import ngrams
from collections import Counter

result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')

for n in range(len(sentence.split(' ')), 1, -1):
    phrases = []

    for token in ngrams(sentence.split(), n):
        phrases.append(' '.join(token))

    phrase, freq = Counter(phrases).most_common(1)[0]
    if freq > 1:
        result.append((phrase, n))
        sentence = sentence.replace(phrase, '')

for phrase, freq in result:
    print('%s: %d' % (phrase, freq))

Run Code Online (Sandbox Code Playgroud)

至于R

这可能会有所帮助

归档时间：	7 年，11 月前
查看次数：	3366 次
最近记录：	7 年，11 月前