Gan*_*ina 4 python dataframe pandas
假设我有2个数据帧:
sub = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
text = pd.DataFrame(['Little Red Corvette must Grow Your ego', 'Grow Your Beans', 'James Dean and his Little Red coat', 'I love pasta'])
Run Code Online (Sandbox Code Playgroud)
一个包含各种主题,另一个包含我应该能够提取主题的文本
我希望文本数据框的输出为:
Text | Subjects
Little Red Corvette must Grow Your ego | Little Red, Grow Your
Grow Your Beans | Grow Your
James Dean and his Little Red coat | Little Red
I love pasta | NaN
Run Code Online (Sandbox Code Playgroud)
知道我怎么能做到这一点?我正在看这个问题:检查一个数据框中的单词是否出现在另一个数据框中(python 3,pandas), 但它并不完全是我想要的输出.谢谢
使用str.findall与加入的所有值sub由|用正则表达式字边界:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
text['new'] = text[0].str.findall(pat).str.join(', ')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red, Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta
Run Code Online (Sandbox Code Playgroud)
如果想要NaN不匹配的值,请使用loc:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
lists = text[0].str.findall(pat)
m = lists.astype(bool)
text.loc[m, 'new'] = lists.loc[m].str.join(',')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red,Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta NaN
Run Code Online (Sandbox Code Playgroud)