熊猫找到轮廓最接近的行

Question

熊猫找到轮廓最接近的行

我有一个充满配置文件的文件，如下所示：

 profile_id  colA  colB  colC  colD
 1           1     20    50    63
 2           1     20    65    38
 3           8     5     3     4
 4           98    1     878   4
 ...

Run Code Online (Sandbox Code Playgroud)

我有另一个 CSV 文件，其中包含我想要查找个人资料的结果：

col    value    score
colA   1        85
colA   1        856
colA   8        200000
colB   1        2356
colC   878      99999
colD   4        2
...

Run Code Online (Sandbox Code Playgroud)

我想提取value每个得分colX最高的并找到与前一个文件中的哪个 profile_id 相关联。

我所做的正在发挥作用：

profiles = pd.read_csv("profiles.csv", sep="\t", index_col=False)
df = pd.read_csv("results.csv", sep="\t", index_col=False)

found_col = set(df["col"])
good_profile = profiles.copy()
for col in profiles.columns:
    if col == "profile_id":
        continue
    elif col not in found_col:
        print(f"{col} not found")
    else:
        value = int(df.loc[df[df["col"] == col]["score"].idxmax()].value)
        good_profile = good_profile[good_profile[col] == value]
 print(good_profile)

Run Code Online (Sandbox Code Playgroud)

这给了我想要的结果，但我首先提取第一列的子集，然后提取第二列的该子集的子集等等......

最酷的事情是，当我错过一些专栏时，我也会得到一个结果，这很棒。

我想知道是否有一种方法可以做得更好，而不必使用在前一个子集上创建子集。

Answer 1

Qua*_*ang 0

这是我的尝试：

# extract the id with max scores
new_df = df2.loc[df2.groupby('col').score.idxmax(), ['col','value']]

# merge
new_df.merge(df1.melt(id_vars='profile_id', var_name='col'),
             on=['col','value'],
             how='left')

Run Code Online (Sandbox Code Playgroud)

输出：

    col  value  profile_id
0  colA      8           3
1  colB      1           4
2  colC    878           4
3  colD      4           3
4  colD      4           4

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，8 月前
查看次数：	98 次
最近记录：	6 年，8 月前