从 NLTK 分布中删除除停用词之外的特定词

jas*_*son 0 python list nltk

我有一个像这样的简单句子。我想把介词和词如AIT从列表中删除。我查看了自然语言工具包 (NLTK) 文档,但找不到任何内容。有人可以告诉我怎么做吗?这是我的代码:

import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)
Run Code Online (Sandbox Code Playgroud)

b30*_*000 6

可能禁用词是您正在寻找的解决方案吗?

你可以很容易地从标记化的文本中过滤它们:

from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

en_stopws = stopwords.words('english')  # this loads the default stopwords list for English
en_stopws.append('spam')  # add any words you don't like to the list

test = "Hello, this is my sentence. It is a very basic sentence with not much information in it but a lot of spam"
test = test.lower()  # I changed it to lower(), since stopwords are all lower case
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens = [token for token in tokens if token not in en_stopws]  # filter stopwords
fdist = FreqDist(tokens)
common = fdist.most_common(100)
Run Code Online (Sandbox Code Playgroud)

我没有找到从 中删除条目的好方法,FreqDist如果您发现某些内容请告诉我。