小编Shu*_*ngh的帖子

预处理熊猫数据框中的字符串数据

我有一个用户评论数据集。我已经加载了这个数据集，现在我想在将其拟合到分类器之前对用户评论进行预处理（即删除停用词、标点符号、转换为小写、删除称呼等），但我遇到了错误。这是我的代码：

    import pandas as pd
    import numpy as np
    df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
    dataset=df.filter(['overall','reviewText'],axis=1)
    def cleanText(text):
        """
        removes punctuation, stopwords and returns lowercase text in a list 
        of single words
        """
        text = (text.lower() for text in text)   

        from bs4 import BeautifulSoup
        text = BeautifulSoup(text).get_text()

        from nltk.tokenize import RegexpTokenizer
        tokenizer = RegexpTokenizer(r'\w+')
        text = tokenizer.tokenize(text)

        from nltk.corpus import stopwords
        clean = [word for word in text if word not in 
        stopwords.words('english')]

        return clean

    dataset['reviewText']=dataset['reviewText'].apply(cleanText)
    dataset['reviewText']

Run Code Online (Sandbox Code Playgroud)

我收到这些错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> …

Run Code Online (Sandbox Code Playgroud)

machine-learning nltk python-3.x pandas data-cleaning

Shu*_*ngh

2017 12-23

2
推荐指数

1
解决办法

7090
查看次数

标签统计

data-cleaning ×1

machine-learning ×1

nltk ×1

pandas ×1

python-3.x ×1

预处理熊猫数据框中的字符串数据

标签 统计

小编Shu_ngh的帖子

标签统计