在另一个数据框中找到的句子中查找存储在数据框中的短语

Gan*_*ina 4 python dataframe pandas

假设我有2个数据帧:

sub = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
text = pd.DataFrame(['Little Red Corvette must Grow Your ego', 'Grow Your Beans', 'James Dean and his Little Red coat', 'I love pasta'])
Run Code Online (Sandbox Code Playgroud)

一个包含各种主题,另一个包含我应该能够提取主题的文本

我希望文本数据框的输出为:

Text                                    | Subjects
Little Red Corvette must Grow Your ego  | Little Red, Grow Your
Grow Your Beans                         | Grow Your
James Dean and his Little Red coat      | Little Red
I love pasta                            | NaN
Run Code Online (Sandbox Code Playgroud)

知道我怎么能做到这一点?我正在看这个问题:检查一个数据框中的单词是否出现在另一个数据框中(python 3,pandas), 但它并不完全是我想要的输出.谢谢

jez*_*ael 5

使用str.findall与加入的所有值sub|用正则表达式字边界:

pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
text['new'] = text[0].str.findall(pat).str.join(', ')
print (text)
                                        0                    new
0  Little Red Corvette must Grow Your ego  Little Red, Grow Your
1                         Grow Your Beans              Grow Your
2      James Dean and his Little Red coat             Little Red
3                            I love pasta                       
Run Code Online (Sandbox Code Playgroud)

如果想要NaN不匹配的值,请使用loc:

pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
lists = text[0].str.findall(pat)
m = lists.astype(bool)
text.loc[m, 'new'] = lists.loc[m].str.join(',')
print (text)
                                        0                   new
0  Little Red Corvette must Grow Your ego  Little Red,Grow Your
1                         Grow Your Beans             Grow Your
2      James Dean and his Little Red coat            Little Red
3                            I love pasta                   NaN
Run Code Online (Sandbox Code Playgroud)