小编Shu*_*ngh的帖子

预处理熊猫数据框中的字符串数据

我有一个用户评论数据集。我已经加载了这个数据集,现在我想在将其拟合到分类器之前对用户评论进行预处理(即删除停用词、标点符号、转换为小写、删除称呼等),但我遇到了错误。这是我的代码:

    import pandas as pd
    import numpy as np
    df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
    dataset=df.filter(['overall','reviewText'],axis=1)
    def cleanText(text):
        """
        removes punctuation, stopwords and returns lowercase text in a list 
        of single words
        """
        text = (text.lower() for text in text)   

        from bs4 import BeautifulSoup
        text = BeautifulSoup(text).get_text()

        from nltk.tokenize import RegexpTokenizer
        tokenizer = RegexpTokenizer(r'\w+')
        text = tokenizer.tokenize(text)

        from nltk.corpus import stopwords
        clean = [word for word in text if word not in 
        stopwords.words('english')]

        return clean

    dataset['reviewText']=dataset['reviewText'].apply(cleanText)
    dataset['reviewText']
Run Code Online (Sandbox Code Playgroud)

我收到这些错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> …
Run Code Online (Sandbox Code Playgroud)

machine-learning nltk python-3.x pandas data-cleaning

2
推荐指数
1
解决办法
7090
查看次数