使用python从语料库中提取最常用的单词

Question

使用python从语料库中提取最常用的单词

use*_*220 2 python dictionary frequency word-count

也许这是一个愚蠢的问题，但是我在使用Python从语料库中提取十个最常见的单词时遇到了问题。这就是到目前为止。（顺便说一句，我与NLTK一起阅读一个带有两个子类别的语料库，每个子类别有10个.txt文件）

import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')

from collections import defaultdict
from operator import itemgetter

def toptenwords(mycorpus):
    words = mycorpus.words()
    no_capitals = set([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist]
    no_punct = [s.translate(None, string.punctuation) for s in filtered] 
    wordcounter = {}
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
    return sorting

Run Code Online (Sandbox Code Playgroud)

如果我用语料库打印此函数，它会给我列出所有后面带有“ 1”的单词的列表。它给了我一本字典，但是我所有的价值观都是一个。而且我知道例如“ baby”一词在我的语料库中是五到六次...而且它仍然给“ baby：1” ...所以它不能按照我想要的方式运行...
有人可以帮忙吗我？

Answer 1

小智 5

如果仍然使用NLTK，请尝试使用FreqDist（samples）函数首先根据给定的样本生成频率分布。然后，调用most_common（n）属性以找到样本中的n个最常见的单词，并按降序排列。就像是：

from nltk.probability import FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist.most_common(10)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ami*_*aha 5

pythonic方式：

In [1]: from collections import Counter

In [2]: words = ['hello', 'hell', 'owl', 'hello', 'world', 'war', 'hello', 'war']

In [3]: counter_obj = Counter(words)

In [4]: counter_obj.most_common() #counter_obj.most_common(n=10)
Out[4]: [('hello', 3), ('war', 2), ('hell', 1), ('world', 1), ('owl', 1)]

Run Code Online (Sandbox Code Playgroud)

Answer 3

pca*_*cao 3

问题出在你对的使用上set。

一组不包含重复项，因此当您以小写形式创建一组单词时，此后每个单词仅出现一次。

假设您words是：

 ['banana', 'Banana', 'tomato', 'tomato','kiwi']

Run Code Online (Sandbox Code Playgroud)

在 lambda 降低所有情况后，您将得到：

 ['banana', 'banana', 'tomato', 'tomato','kiwi']

Run Code Online (Sandbox Code Playgroud)

但随后你会：

 set(['banana', 'Banana', 'tomato', 'tomato','kiwi'])

Run Code Online (Sandbox Code Playgroud)

返回：

 ['banana','tomato','kiwi']

Run Code Online (Sandbox Code Playgroud)

从那一刻起，您的计算就基于该no_capitals集合，因此每个单词只会出现一次。不要创建set，您的程序可能会正常工作。

归档时间：	12 年，9 月前
查看次数：	5712 次
最近记录：	7 年，8 月前