我有一个用户评论数据集。我已经加载了这个数据集,现在我想在将其拟合到分类器之前对用户评论进行预处理(即删除停用词、标点符号、转换为小写、删除称呼等),但我遇到了错误。这是我的代码:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
Run Code Online (Sandbox Code Playgroud)
我收到这些错误:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> …Run Code Online (Sandbox Code Playgroud)