Pandas - 根据多列进行分组并在组内排名

rig*_*ere 5 python dataframe pandas

我有一个像这样的数据框:

df = pd.DataFrame({'asset_id': [10,10, 10, 20, 20, 20], 'method_id': ['p2','p3','p4', 'p3', 'p1', 'p2'], 'method_rank': [5, 2, 2, 2, 5, 1], 'conf_score': [0.8, 0.6, 0.8, 0.9, 0.7, 0.5]} , columns= ['asset_id', 'method_id','method_rank', 'conf_score']) 
Run Code Online (Sandbox Code Playgroud)

它看起来像这样:

   asset_id method_id  method_rank  conf_score
0    10        p2          5         0.8
1    10        p3          2         0.6
2    10        p4          2         0.8
3    20        p3          2         0.9
4    20        p1          5         0.7
5    20        p2          1         0.5
Run Code Online (Sandbox Code Playgroud)

我想按资产 ID 对行进行分组,然后根据method_rank升序和conf_score降序为每行提供总体排名。

IE。我希望结果看起来像这样:

  asset_id method_id  method_rank  conf_score  overall_rank
5    20        p2         1           0.5          1.0
3    20        p3         2           0.9          2.0
2    10        p4         2           0.8          1.0
1    10        p3         2           0.6          2.0
0    10        p2         5           0.8          3.0
4    20        p1         5           0.7          3.0
Run Code Online (Sandbox Code Playgroud)

如何使用 pandas 中的分组依据和排名来做到这一点?看起来在 pandas 中你只能基于一列来完成,比如

df["overall_rank"] = df.groupby('asset_id')['method_rank'].rank("first")
Run Code Online (Sandbox Code Playgroud)

但我想实现类似的目标

df["overall_rank"] = df.groupby('asset_id')[['method_rank', 'conf_score']].rank("first", ascending = [True, False])
Run Code Online (Sandbox Code Playgroud)

我该怎么做呢?我知道一种黑客方法是首先sort_values在整个数据帧上使用,然后执行groupby,但是当我只想对每个组中的几行进行排序时,对整个数据帧的行进行排序似乎太昂贵了。

l m*_*zhi 10

方法一:

\n
df.sort_values([\'asset_id\', \'method_rank\', \'conf_score\'], ascending=[True, True, False], inplace=True)\ndf[\'overall_rank\'] = 1\ndf[\'overall_rank\'] = df.groupby([\'asset_id\'])[\'overall_rank\'].cumsum()\n
Run Code Online (Sandbox Code Playgroud)\n

df

\n
   asset_id method_id  method_rank  conf_score  overall_rank\n2        10        p4            2         0.8             1\n1        10        p3            2         0.6             2\n0        10        p2            5         0.8             3\n5        20        p2            1         0.5             1\n3        20        p3            2         0.9             2\n4        20        p1            5         0.7             3\n
Run Code Online (Sandbox Code Playgroud)\n

方法2:

\n

定义一个函数对每个组进行排序:

\n
df.sort_values([\'asset_id\', \'method_rank\', \'conf_score\'], ascending=[True, True, False], inplace=True)\ndf[\'overall_rank\'] = 1\ndf[\'overall_rank\'] = df.groupby([\'asset_id\'])[\'overall_rank\'].cumsum()\n
Run Code Online (Sandbox Code Playgroud)\n
\n

性能测试:

\n
def run1(df):\n    df = df.sort_values([\'asset_id\', \'method_rank\', \'conf_score\'], ascending=[True, True, False])\n    df[\'overall_rank\'] = 1\n    df[\'overall_rank\'] = df.groupby([\'asset_id\'])[\'overall_rank\'].cumsum()    \n    return df\n\ndef handle_group(group):\n    group.sort_values([\'method_rank\', \'conf_score\'], ascending=[True, False], inplace=True)\n    group[\'overall_rank\'] = np.arange(1, len(group)+1)\n    return group\n\ndef run2(df):\n    df = df.groupby(\'asset_id\', as_index=False).apply(handle_group)\n    return df\n\ndfn = pd.concat([df]*10000, ignore_index=True)\n\n%%timeit\ndf1 = run1(dfn)\n# 8.61 ms \xc2\xb1 317 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n\n%%timeit\ndf2 = run2(dfn).droplevel(0)\n# 31.6 ms \xc2\xb1 404 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n