两个 DataFrame 的城市名称格式不同。我想对两个 DataFrame 中的geo字段之间的所有部分字符串匹配进行左外连接和拉取字段City。
import pandas as pd
df1 = pd.DataFrame({
'City': ['San Francisco, CA','Oakland, CA'],
'Val': [1,2]
})
df2 = pd.DataFrame({
'City': ['San Francisco-Oakland, CA','Salinas, CA'],
'Geo': ['geo1','geo2']
})
Run Code Online (Sandbox Code Playgroud)
加入后预计DataFrame:
City Val Geo
San Francisco, CA 1 geo1
Oakland, CA 2 geo1
Run Code Online (Sandbox Code Playgroud)
Cor*_*ien 16
更新:该fuzzywuzzy项目已重命名thefuzz并移至此处
您可以使用thefuzz包和函数extractOne:
# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy
from thefuzz import process
best_city = lambda x: process.extractOne(x, df2["City"])[2] # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values
Run Code Online (Sandbox Code Playgroud)
输出:
>>> df1
City Val Geo
0 San Francisco, CA 1 geo1
1 Oakland, CA 2 geo1
Run Code Online (Sandbox Code Playgroud)
注意:extractOne从最佳匹配中返回 3 个值的元组:df2[0] 中的城市名称、准确度得分 [1] 和索引 [2](<- 我使用的)。
这应该可以完成工作。与Levenshtein_distance 的字符串匹配。
pip install thefuzz[speedup]
import pandas as pd
import numpy as np
from thefuzz import process
def fuzzy_match(
a: pd.DataFrame, b: pd.DataFrame, col: str, limit: int = 5, thresh: int = 80
):
"""use fuzzy matching to join on column"""
s = b[col].tolist()
matches = a[col].apply(lambda x: process.extract(x, s, limit=limit))
matches = pd.DataFrame(np.concatenate(matches), columns=["match", "score"])
# join other columns in b to matches
to_join = (
pd.merge(left=b, right=matches, how="right", left_on="City", right_on="match")
.set_index( # create an index that represents the matching row in df a, you can drop this when `limit=1`
np.array(
list(
np.repeat(i, limit if limit < len(b) else len(b))
for i in range(len(a))
)
).flatten()
)
.drop(columns=["match"])
.astype({"score": "int16"})
)
print(f"\t the index here represents the row in dataframe a on which to join")
print(to_join)
res = pd.merge(
left=a, right=to_join, left_index=True, right_index=True, suffixes=("", "_b")
)
# return only the highest match or you can just set the limit to 1
# and remove this
df = res.reset_index()
df = df.iloc[df.groupby(by="index")["score"].idxmax()].reset_index(drop=True)
return df.drop(columns=["City_b", "score", "index"])
def test(df):
expected = pd.DataFrame(
{
"City": ["San Francisco, CA", "Oakland, CA"],
"Val": [1, 2],
"Geo": ["geo1", "geo1"],
}
)
print(f'{"expected":-^70}')
print(expected)
print(f'{"res":-^70}')
print(df)
assert expected.equals(df)
if __name__ == "__main__":
a = pd.DataFrame({"City": ["San Francisco, CA", "Oakland, CA"], "Val": [1, 2]})
b = pd.DataFrame(
{"City": ["San Francisco-Oakland, CA", "Salinas, CA"], "Geo": ["geo1", "geo2"]}
)
print(f'\n\n{"fuzzy match":-^70}')
res = fuzzy_match(a, b, col="City")
test(res)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4120 次 |
| 最近记录: |