从 NLTK 分布中删除除停用词之外的特定词

Question

从 NLTK 分布中删除除停用词之外的特定词

我有一个像这样的简单句子。我想把介词和词如A和IT从列表中删除。我查看了自然语言工具包 (NLTK) 文档，但找不到任何内容。有人可以告诉我怎么做吗？这是我的代码：

import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)

Run Code Online (Sandbox Code Playgroud)

Answer 1

b30*_*000 6

可能禁用词是您正在寻找的解决方案吗？

你可以很容易地从标记化的文本中过滤它们：

from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

en_stopws = stopwords.words('english')  # this loads the default stopwords list for English
en_stopws.append('spam')  # add any words you don't like to the list

test = "Hello, this is my sentence. It is a very basic sentence with not much information in it but a lot of spam"
test = test.lower()  # I changed it to lower(), since stopwords are all lower case
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens = [token for token in tokens if token not in en_stopws]  # filter stopwords
fdist = FreqDist(tokens)
common = fdist.most_common(100)

Run Code Online (Sandbox Code Playgroud)

我没有找到从中删除条目的好方法，FreqDist如果您发现某些内容请告诉我。

归档时间：	10 年，5 月前
查看次数：	4481 次
最近记录：	10 年，3 月前