摆脱停用词和标点符号

Question

摆脱停用词和标点符号

我正在努力使用NLTK的限制词.

这是我的一些代码..有人能告诉我什么是错的吗？

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''

Run Code Online (Sandbox Code Playgroud)

Answer 1

JHS*_*ers 25

您的问题是字符串的迭代器返回每个字符而不是每个字.

例如:

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

Run Code Online (Sandbox Code Playgroud)

你需要迭代并检查每个单词,幸运的是,split函数已经存在于字符串模块下的python标准库中.但是,您正在处理自然语言,包括标点符号,您应该在这里查看使用该re模块的更强大的答案.

一旦你有一个单词列表,你应该在比较之前将它们全部小写,然后以你已经显示的方式比较它们.

Buena suerte.

编辑1

好的尝试这个代码,它应该适合你.它显示了两种方法,它们本质上是相同的,但第一种方式更清晰,而第二种方式更为pythonic.

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words

Run Code Online (Sandbox Code Playgroud)

我希望这可以帮助你.

归档时间：	14 年，9 月前
查看次数：	28461 次
最近记录：	8 年，5 月前