在 pandas 字符串列中查找多个关键字的更有效方法

Question

在 pandas 字符串列中查找多个关键字的更有效方法

lsa*_*ama 4 python performance dataframe pandas

我有一个包含许多行字符串的数据框：btb['Title']。我想确定每个字符串是否包含肯定、否定或中性关键字。以下方法有效，但速度相当慢：

positive_kw =('rise','positive','high','surge')
negative_kw = ('sink','lower','fall','drop','slip','loss','losses')
neutral_kw = ('flat','neutral')
#create new columns, turn value to one if keyword exists in sentence
btb['Positive'] = np.nan
btb['Negative'] = np.nan
btb['Neutral'] = np.nan

#Turn value to one if keyword exists in sentence
for index, row in btb.iterrows():
    if any(s in row.Title for s in positive_kw) == True:
        btb['Positive'].loc[index] = 1
    if any(s in row.Title for s in negative_kw) == True:
        btb['Negative'].loc[index] = 1
    if any(s in row.Title for s in neutral_kw) == True:
        btb['Neutral'].loc[index] = 1

Run Code Online (Sandbox Code Playgroud)

感谢您的宝贵时间，我很想了解提高此代码性能所需的内容

Answer 1

Mat*_*0se 5

您可以使用'|'.join单词列表来创建与任何单词（至少一个）匹配的正则表达式模式，然后您可以使用该pandas.Series.str.contains()方法为匹配创建布尔掩码。

import pandas as pd

# create regex pattern out of the list of words
positive_kw = '|'.join(['rise','positive','high','surge'])
negative_kw = '|'.join(['sink','lower','fall','drop','slip','loss','losses'])
neutral_kw = '|'.join(['flat','neutral'])

# creating some fake data for demonstration
words = [
        'rise high',
        'positive attitude',
        'something',
        'foo',
        'lowercase',
        'flat earth',
        'neutral opinion'
        ]

df = pd.DataFrame(data=words, columns=['words'])

df['positive'] = df['words'].str.contains(positive_kw).astype(int)
df['negative'] = df['words'].str.contains(negative_kw).astype(int)
df['neutral'] = df['words'].str.contains(neutral_kw).astype(int)

print(df)

Run Code Online (Sandbox Code Playgroud)

输出：

               words  positive  negative  neutral
0          rise high         1         0        0
1  positive attitude         1         0        0
2          something         0         0        0
3                foo         0         0        0
4          lowercase         0         1        0
5         flat earth         0         0        1
6    neutral opinion         0         0        1

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	3430 次
最近记录：	6 年，3 月前