如何使用nltk或python删除停用词

Question

如何使用nltk或python删除停用词

所以我有一个数据集,我想删除使用的停止词

stopwords.words('english')

Run Code Online (Sandbox Code Playgroud)

我正在努力如何在我的代码中使用它只是简单地取出这些单词.我已经有了这个数据集中的单词列表,我正在努力的部分是与此列表进行比较并删除停用词.任何帮助表示赞赏.

Answer 1

Dar*_*mas 184

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Run Code Online (Sandbox Code Playgroud)

为了提高性能,请考虑```stops = set(stopwords.words("english"))```. (46认同)
>>> import nltk >>> nltk.download() [来源](http://www.nltk.org/data.html) (2认同)
`stopwords.words('english')` 是小写的。所以确保在列表中只使用小写单词，例如`[w.lower() for w in word_list]` (2认同)

Answer 2

小智 19

你也可以做一个设置差异,例如:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Run Code Online (Sandbox Code Playgroud)

注意:这会将句子转换为SET,从而删除所有重复的单词,因此您将无法对结果使用频率计数 (14认同)
转换为集合可能会通过删除多次出现的重要单词来从句子中删除可行的信息。 (2认同)

Answer 3

das*_*zul 14

我想你有一个单词列表(word_list),你想从中删除停用词.你可以这样做:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Run Code Online (Sandbox Code Playgroud)

这将比达伦托马斯的名单理解慢很多...... (4认同)

Answer 4

sum*_*njr 10

要排除所有类型的停用词,包括nltk停用词,你可以这样做:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Run Code Online (Sandbox Code Playgroud)

我得到 `len(get_stop_words('en')) == 174` vs `len(stopwords.words('english')) == 179` (2认同)

Answer 5

use*_*pij 8

stop-words为此，有一个非常简单的轻量级 python 包。

首先使用以下命令安装软件包： pip install stop-words

然后你可以使用列表理解在一行中删除你的单词：

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

Run Code Online (Sandbox Code Playgroud)

这个包下载非常轻量级（与 nltk 不同），适用于Python 2和Python 3，并且它有许多其他语言的停用词，例如：

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

Run Code Online (Sandbox Code Playgroud)

Answer 6

jus*_*dev 6

如果您想立即将答案放入字符串（而不是过滤后的单词列表）中，这是我对此的看法：

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

Run Code Online (Sandbox Code Playgroud)

Answer 7

Yug*_*yal 5

使用textcleaner库从您的数据中删除停用词。

按照此链接：https : //yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

请按照以下步骤使用此库执行此操作。

pip install textcleaner

Run Code Online (Sandbox Code Playgroud)

安装后：

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Run Code Online (Sandbox Code Playgroud)

使用上面的代码删除停用词。

归档时间：	14 年，11 月前
查看次数：	157314 次
最近记录：	6 年，5 月前