如何从 NLTK 扩展停用词列表并使用扩展列表删除停用词？

Question

如何从 NLTK 扩展停用词列表并使用扩展列表删除停用词？

我尝试了两种删除停用词的方法，这两种方法都遇到了问题：

方法一：

cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via &amp; that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)

Run Code Online (Sandbox Code Playgroud)

在这种情况下，只有第一个 remove 函数起作用。remove2 不起作用。

方法二：

lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words

Run Code Online (Sandbox Code Playgroud)

输出看起来像这样 ["Hello", "Good", "day"]

我尝试从单词中删除停用词。这是我的代码：

for word in words:
    if word in cachedStopwords:
        continue
    else:
        new_words='\n'.join(word)

print new_words

Run Code Online (Sandbox Code Playgroud)

输出如下所示：

H
e
l
l
o

Run Code Online (Sandbox Code Playgroud)

无法弄清楚上述两种方法有什么问题。请指教。

Answer 1

Aka*_*pal 6

使用它来增加停用词列表：

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stop_words))

Run Code Online (Sandbox Code Playgroud)

输出：

179

184

Answer 2

jks*_*snw -1

您需要标记您的字符串：

words = string.split()

Run Code Online (Sandbox Code Playgroud)

尽管 nltk 有其他标记器，但这是一种简单的方法。

然后也许是一个列表理解：

words = [w for w in words if w not in cachedstopwords]

Run Code Online (Sandbox Code Playgroud)

这：

from nltk.corpus import stopwords

stop_words = stopwords.words("english")
sentence = "You'll want to tokenise your string"

words = sentence.split()
print words
words = [w for w in words if w not in stop_words]
print words

Run Code Online (Sandbox Code Playgroud)

印刷：

["You'll", 'want', 'to', 'tokenise', 'your', 'string']
["You'll", 'want', 'tokenise', 'string']

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，9 月前
查看次数：	11419 次
最近记录：	7 年，5 月前