KMF*_*MFR 2 python string-matching dataframe pandas
我想用df1数据框的"类别"列中的正确值填充数据df2框的"类别"列.
import pandas as pd
df1 = pd.DataFrame({"Receiver": ["Insurance company", "Shop", "Pizza place", "Library", "Gas station 24/7", "Something else", "Whatever receiver"], "Category": ["","","","","","",""]})
df2 = pd.DataFrame({"Category": ["Insurances", "Groceries", "Groceries", "Fastfood", "Fastfood", "Car"], "Searchterm": ["Insurance", "Shop", "Market", "Pizza", "Burger", "Gas"]})
Run Code Online (Sandbox Code Playgroud)
输出:
df1
Receiver Category
0 Insurance company
1 Shop
2 Pizza place
3 Library
4 Gas station 24/7
5 Something else
6 Whatever receiver
df2
Category Searchterm
0 Insurances Insur
1 Groceries Shop
2 Groceries Market
3 Fastfood Pizza
4 Fastfood Burger
5 Car Gas
Run Code Online (Sandbox Code Playgroud)
我想比较df1["Receiver"],以df2["Searchterm"]逐行,和其中后者甚至部分匹配前,指定该行的df2["Category"]对df1["Category"].
例如,"Pizza" df2["Searchterm"]部分匹配"Pizza place" df1["Receiver"],所以我想将"Fastfood"(这是Pizza的类别df2["Category"])分配给"Pizza place"的类别df1["Category"].
期望的输出是:
df1
Receiver Category
0 Insurance company Insurances
1 Shop Groceries
2 Pizza place Fastfood
3 Library
4 Gas station 24/7 Car
5 Something else
6 Whatever receiver
Run Code Online (Sandbox Code Playgroud)
那么我怎样才能填写df1["Category"]正确的类别?谢谢.
在假设类别数量相对于接收器数量较小的情况下,一种策略是迭代类别.使用此解决方案,请注意最后一个匹配仅会粘贴到找到多个类别的位置.
for tup in df2.itertuples(index=False):
mask = df1['Receiver'].str.contains(tup.Searchterm, regex=False)
df1.loc[mask, 'Category'] = tup.Category
print(df1)
# Category Receiver
# 0 Insurances Insurance company
# 1 Groceries Shop
# 2 Fastfood Pizza place
# 3 Library
# 4 Car Gas station 24/7
# 5 Something else
# 6 Whatever receiver
Run Code Online (Sandbox Code Playgroud)
如上所述,此解决方案可以更好地扩展行,而df1不是类别df2.为了说明,请考虑下面针对不同大小的输入数据帧的性能.
def jpp(df1, df2):
for tup in df2.itertuples(index=False):
df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
return df1
def user347(df1, df2):
df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
df2['Category'].values,
regex=True)
df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
return df1
df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)
%timeit jpp(df1, df2) # 145 ms per loop
%timeit user347(df1, df2) # 364 ms per loop
df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)
%timeit jpp(df1, df2) # 666 ms per loop
%timeit user347(df1, df2) # 88 ms per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
85 次 |
| 最近记录: |