lsa*_*ama 4 python performance dataframe pandas
我有一个包含许多行字符串的数据框:btb['Title']。我想确定每个字符串是否包含肯定、否定或中性关键字。以下方法有效,但速度相当慢:
positive_kw =('rise','positive','high','surge')
negative_kw = ('sink','lower','fall','drop','slip','loss','losses')
neutral_kw = ('flat','neutral')
#create new columns, turn value to one if keyword exists in sentence
btb['Positive'] = np.nan
btb['Negative'] = np.nan
btb['Neutral'] = np.nan
#Turn value to one if keyword exists in sentence
for index, row in btb.iterrows():
if any(s in row.Title for s in positive_kw) == True:
btb['Positive'].loc[index] = 1
if any(s in row.Title for s in negative_kw) == True:
btb['Negative'].loc[index] = 1
if any(s in row.Title for s in neutral_kw) == True:
btb['Neutral'].loc[index] = 1
Run Code Online (Sandbox Code Playgroud)
感谢您的宝贵时间,我很想了解提高此代码性能所需的内容
您可以使用'|'.join单词列表来创建与任何单词(至少一个)匹配的正则表达式模式,然后您可以使用该pandas.Series.str.contains()方法为匹配创建布尔掩码。
import pandas as pd
# create regex pattern out of the list of words
positive_kw = '|'.join(['rise','positive','high','surge'])
negative_kw = '|'.join(['sink','lower','fall','drop','slip','loss','losses'])
neutral_kw = '|'.join(['flat','neutral'])
# creating some fake data for demonstration
words = [
'rise high',
'positive attitude',
'something',
'foo',
'lowercase',
'flat earth',
'neutral opinion'
]
df = pd.DataFrame(data=words, columns=['words'])
df['positive'] = df['words'].str.contains(positive_kw).astype(int)
df['negative'] = df['words'].str.contains(negative_kw).astype(int)
df['neutral'] = df['words'].str.contains(neutral_kw).astype(int)
print(df)
Run Code Online (Sandbox Code Playgroud)
输出:
words positive negative neutral
0 rise high 1 0 0
1 positive attitude 1 0 0
2 something 0 0 0
3 foo 0 0 0
4 lowercase 0 1 0
5 flat earth 0 0 1
6 neutral opinion 0 0 1
Run Code Online (Sandbox Code Playgroud)