我有一个像这样的简单句子。我想把介词和词如A和IT从列表中删除。我查看了自然语言工具包 (NLTK) 文档,但找不到任何内容。有人可以告诉我怎么做吗?这是我的代码:
import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)
Run Code Online (Sandbox Code Playgroud)
可能禁用词是您正在寻找的解决方案吗?
你可以很容易地从标记化的文本中过滤它们:
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
en_stopws = stopwords.words('english') # this loads the default stopwords list for English
en_stopws.append('spam') # add any words you don't like to the list
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it but a lot of spam"
test = test.lower() # I changed it to lower(), since stopwords are all lower case
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens = [token for token in tokens if token not in en_stopws] # filter stopwords
fdist = FreqDist(tokens)
common = fdist.most_common(100)
Run Code Online (Sandbox Code Playgroud)
我没有找到从 中删除条目的好方法,FreqDist如果您发现某些内容请告诉我。