在 pd DataFrame 中为每个组创建相对值

Nra*_*ras 2 python pandas pandas-groupby

考虑这个带有许多列的 DataFrame,但它在 列 中定义了一个功能'feature',并在 列 中定义了一些值'values'

我想要在额外的列中显示每个特征(组)的相对值所需的结果是由我在列中手动预先计算的'desired'

df = pd.DataFrame(
    data={
        'feature': [1, 1, 2, 3, 3, 3],
        'values': [30.0, 20.0, 25.0, 100.0, 250.0, 50.0],
        'desired': [0.6, 0.4, 1.0, 0.25, 0.625, 0.125],
        'more_columns': range(6),
    },
)
Run Code Online (Sandbox Code Playgroud)

这导致了 DataFrame

   feature  values  desired  more_columns
0        1    30.0    0.600             0
1        1    20.0    0.400             1
2        2    25.0    1.000             2
3        3   100.0    0.250             3
4        3   250.0    0.625             4
5        3    50.0    0.125             5
Run Code Online (Sandbox Code Playgroud)

因此,对于由特征定义的组,1所需的值为 0.6 和 0.4(因为0.6 = 30 / (20+30)),依此类推。

我使用手动得出这些值

for feature, group in df.groupby('feature'):
    rel_values = (group['values'] / group['values'].sum()).values
    df[df['feature'] == feature]['result'] = rel_values  # no effect
    print(f'{feature}: {rel_values}')

# which prints:
1: [0.6 0.4]
2: [1.]
3: [0.25  0.625 0.125]

# but df remains unchanged
Run Code Online (Sandbox Code Playgroud)

我相信 pandas 一定有一种聪明而快速的方法来实现这一点。

jez*_*ael 5

用于GroupBy.transform返回Seriessum原始大小相同的 ed 值df,因此可能除以div

\n\n
df[\'new\'] = df[\'values\'].div(df.groupby(\'feature\')[\'values\'].transform(\'sum\'))\nprint (df)\n   feature  values  desired  more_columns    new\n0        1    30.0    0.600             0  0.600\n1        1    20.0    0.400             1  0.400\n2        2    25.0    1.000             2  1.000\n3        3   100.0    0.250             3  0.250\n4        3   250.0    0.625             4  0.625\n5        3    50.0    0.125             5  0.125\n
Run Code Online (Sandbox Code Playgroud)\n\n

细节

\n\n
print (df.groupby(\'feature\')[\'values\'].transform(\'sum\'))\n0     50.0\n1     50.0\n2     25.0\n3    400.0\n4    400.0\n5    400.0\nName: values, dtype: float64\n
Run Code Online (Sandbox Code Playgroud)\n\n

表现:

\n\n

在实际数据中取决于组的数量和长度DataFrame

\n\n
np.random.seed(123)\nN = 1000000\nL = np.random.randint(1000,size=N)\ndf = pd.DataFrame({\'feature\': np.random.choice(L, N),\n                   \'values\':np.random.rand(N)})\n#print (df)\n\nIn [272]: %timeit df[\'new\'] = df[\'values\'].div(df.groupby(\'feature\')[\'values\'].transform(\'sum\'))\n80.7 ms \xc2\xb1 2.78 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\nIn [273]: %timeit df[\'desired\'] = df.groupby(\'feature\').apply(lambda g: g[\'values\'] / g[\'values\'].sum()).values\n1.17 s \xc2\xb1 23.9 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [274]: %timeit df[\'desired\'] = df.groupby(\'feature\')[\'values\'].transform(lambda x: x / x.sum())\n727 ms \xc2\xb1 14.4 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n