查找两个不同数据帧列之间的部分匹配,并在找到匹配项时分配值

KMF*_*MFR 2 python string-matching dataframe pandas

我想用df1数据框的"类别"列中的正确值填充数据df2框的"类别"列.

import pandas as pd

df1 = pd.DataFrame({"Receiver": ["Insurance company", "Shop", "Pizza place", "Library", "Gas station 24/7", "Something else", "Whatever receiver"], "Category": ["","","","","","",""]}) 
df2 = pd.DataFrame({"Category": ["Insurances", "Groceries", "Groceries", "Fastfood", "Fastfood", "Car"], "Searchterm": ["Insurance", "Shop", "Market", "Pizza", "Burger", "Gas"]})
Run Code Online (Sandbox Code Playgroud)

输出:

df1
Receiver                Category
0   Insurance company   
1   Shop    
2   Pizza place 
3   Library 
4   Gas station 24/7    
5   Something else  
6   Whatever receiver   

df2
    Category    Searchterm
0   Insurances  Insur
1   Groceries   Shop
2   Groceries   Market
3   Fastfood    Pizza
4   Fastfood    Burger
5   Car         Gas
Run Code Online (Sandbox Code Playgroud)

我想比较df1["Receiver"],以df2["Searchterm"]逐行,和其中后者甚至部分匹配前,指定该行的df2["Category"]df1["Category"].

例如,"Pizza" df2["Searchterm"]部分匹配"Pizza place" df1["Receiver"],所以我想将"Fastfood"(这是Pizza的类别df2["Category"])分配给"Pizza place"的类别df1["Category"].

期望的输出是:

df1
Receiver                Category
0   Insurance company   Insurances
1   Shop                Groceries
2   Pizza place         Fastfood
3   Library             
4   Gas station 24/7    Car
5   Something else      
6   Whatever receiver   
Run Code Online (Sandbox Code Playgroud)

那么我怎样才能填写df1["Category"]正确的类别?谢谢.

jpp*_*jpp 5

迭代类别

在假设类别数量相对于接收器数量较小的情况下,一种策略是迭代类别.使用此解决方案,请注意最后一个匹配仅会粘贴到找到多个类别的位置.

for tup in df2.itertuples(index=False):
    mask = df1['Receiver'].str.contains(tup.Searchterm, regex=False)
    df1.loc[mask, 'Category'] = tup.Category

print(df1)

#      Category           Receiver
# 0  Insurances  Insurance company
# 1   Groceries               Shop
# 2    Fastfood        Pizza place
# 3                        Library
# 4         Car   Gas station 24/7
# 5                 Something else
# 6              Whatever receiver
Run Code Online (Sandbox Code Playgroud)

绩效基准

如上所述,此解决方案可以更好地扩展行,而df1不是类别df2.为了说明,请考虑下面针对不同大小的输入数据帧的性能.

def jpp(df1, df2):
    for tup in df2.itertuples(index=False):
        df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
    return df1

def user347(df1, df2):
    df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
                                              df2['Category'].values,
                                              regex=True)
    df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
    return df1

df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)

%timeit jpp(df1, df2)      # 145 ms per loop
%timeit user347(df1, df2)  # 364 ms per loop

df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)

%timeit jpp(df1, df2)      # 666 ms per loop
%timeit user347(df1, df2)  # 88 ms per loop
Run Code Online (Sandbox Code Playgroud)