Python从pandas数据帧中删除停用词

I a*_*rge 28 python pandas

我想从我的专栏"tweets"中删除停用词.如何迭代每一行和每个项目?

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()

from nltk.corpus import stopwords
stop = stopwords.words('english')
Run Code Online (Sandbox Code Playgroud)

Kei*_*iku 30

我们可以stopwordsnltk.corpus下面导入.有了它,我们用Python的列表理解和排除停用词pandas.DataFrame.apply.

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
#                                tweet     class tweet_without_stopwords
# 0                    I love this car  positive              I love car
# 1               This view is amazing  positive       This view amazing
# 2          I feel great this morning  positive    I feel great morning
# 3  I am so excited about the concert  positive       I excited concert
# 4               He is my best friend  positive          He best friend
Run Code Online (Sandbox Code Playgroud)

它也可以通过使用排除pandas.Series.str.replace.

pat = r'\b(?:{})\b'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# Same results.
# 0              I love car
# 1       This view amazing
# 2    I feel great morning
# 3       I excited concert
# 4          He best friend
Run Code Online (Sandbox Code Playgroud)

如果您无法导入停用词,可以按如下方式下载.

import nltk
nltk.download('stopwords')
Run Code Online (Sandbox Code Playgroud)

回答另一种方法是导入text.ENGLISH_STOP_WORDSsklearn.feature_extraction.

# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS
Run Code Online (Sandbox Code Playgroud)

请注意,scikit-learn stopwords和nltk停用词中的单词数量不同.

  • “r'\b(?:{})\b'” 的作用是什么? (4认同)

Lia*_*ley 26

使用列表理解

test['tweet'].apply(lambda x: [item for item in x if item not in stop])
Run Code Online (Sandbox Code Playgroud)

返回:

0               [love, car]
1           [view, amazing]
2    [feel, great, morning]
3        [excited, concert]
4            [best, friend]
Run Code Online (Sandbox Code Playgroud)

  • 我需要添加 `str(x).split()` 并且将是 `test['tweet'].apply(lambda x: [item for item in str(x).split() 如果 item 不在 stopwords.words 中('spanish')])` 因为显示一个错误,指出 **'float' 对象不可迭代** (6认同)
  • 这不保留字符串,因此一旦删除停用词,您将无法搜索单词组合.Ed Chum上面的评论保留了字符串. (2认同)

Cli*_*der 8

如果您想要简单的东西但不想要返回单词列表:

test["tweet"].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))
Run Code Online (Sandbox Code Playgroud)

其中 stop 的定义与 OP 相同。

from nltk.corpus import stopwords
stop = stopwords.words('english')
Run Code Online (Sandbox Code Playgroud)


mok*_*ok0 7

查看 pd.DataFrame.replace(),它可能对你有用:

In [42]: test.replace(to_replace='I', value="",regex=True)
Out[42]:
                              tweet     class
0                     love this car  positive
1              This view is amazing  positive
2           feel great this morning  positive
3   am so excited about the concert  positive
4              He is my best friend  positive
Run Code Online (Sandbox Code Playgroud)

编辑:replace()将搜索字符串(甚至子字符串)。例如,它将替换rkfrom workifrk是一个有时不期望的停用词。

因此在regex这里使用:

for i in stop :
    test = test.replace(to_replace=r'\b%s\b'%i, value="",regex=True)
Run Code Online (Sandbox Code Playgroud)