从熊猫数据框中删除停用词

Question

从熊猫数据框中删除停用词

我有下面的脚本 & 在最后一行，我试图从名为“响应”的列中的字符串中删除停用词。

问题是，不是“有点恼火”变成“有点恼火”，它实际上甚至会丢弃字母 - 因此，有点恼火会变得有点恼火。因为'a'是一个停用词

任何人都可以给我建议吗？

   import pandas as pd
   from textblob import TextBlob
   import numpy as np
   import os
   import nltk
   nltk.download('stopwords')
   from nltk.corpus import stopwords
   stop = stopwords.words('english')

   path = 'Desktop/fanbase2.csv'
   df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
   #remove punctuation
   df['response'] = df.response.str.replace("[^\w\s]", "")
   #make it all lower case
   df['response'] = df.response.apply(lambda x: x.lower())
   #Handle strange character in source
   df['response'] = df.response.str.replace("‰Ûª", "''")

   df['response'] = df['response'].apply(lambda x: [item for item in x if item not in stop])

Run Code Online (Sandbox Code Playgroud)

Answer 1

Vai*_*ali 8

在列表理解（最后一行）中，您正在根据停用词检查每个词，如果该词不在停用词中，则返回它。但是您正在向它传递一个字符串。您需要拆分字符串以使 LC 工作。

df = pd.DataFrame({'response':['This is one type of response!', 'Though i like this one more', 'and yet what is that?']})

df['response'] = df.response.str.replace("[^\w\s]", "").str.lower()

df['response'] = df['response'].apply(lambda x: [item for item in x.split() if item not in stop])


0    [one, type, response]
1      [though, like, one]
2                    [yet]

Run Code Online (Sandbox Code Playgroud)

如果要将响应作为字符串返回，请将最后一行更改为

df['response'] = df['response'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))

0    one type response
1      though like one
2                  yet

Run Code Online (Sandbox Code Playgroud)

谢谢你的帮助！:) :) :) (2认同)

归档时间：	6 年，9 月前
查看次数：	8532 次
最近记录：	6 年，9 月前