kik*_*222 2 python nltk pandas
我有下面的脚本 & 在最后一行,我试图从名为“响应”的列中的字符串中删除停用词。
问题是,不是“有点恼火”变成“有点恼火”,它实际上甚至会丢弃字母 - 因此,有点恼火会变得有点恼火。因为'a'是一个停用词
任何人都可以给我建议吗?
import pandas as pd
from textblob import TextBlob
import numpy as np
import os
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
path = 'Desktop/fanbase2.csv'
df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
#remove punctuation
df['response'] = df.response.str.replace("[^\w\s]", "")
#make it all lower case
df['response'] = df.response.apply(lambda x: x.lower())
#Handle strange character in source
df['response'] = df.response.str.replace("‰Ûª", "''")
df['response'] = df['response'].apply(lambda x: [item for item in x if item not in stop])
Run Code Online (Sandbox Code Playgroud)
在列表理解(最后一行)中,您正在根据停用词检查每个词,如果该词不在停用词中,则返回它。但是您正在向它传递一个字符串。您需要拆分字符串以使 LC 工作。
df = pd.DataFrame({'response':['This is one type of response!', 'Though i like this one more', 'and yet what is that?']})
df['response'] = df.response.str.replace("[^\w\s]", "").str.lower()
df['response'] = df['response'].apply(lambda x: [item for item in x.split() if item not in stop])
0 [one, type, response]
1 [though, like, one]
2 [yet]
Run Code Online (Sandbox Code Playgroud)
如果要将响应作为字符串返回,请将最后一行更改为
df['response'] = df['response'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))
0 one type response
1 though like one
2 yet
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
8532 次 |
| 最近记录: |