为 groupby 编写自己的自定义聚合函数

Des*_*wal 5 python dataframe pandas pandas-groupby

我有一个数据集可以在这里找到

它给了我们一个DataFrame喜欢

df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
df.head()

    user_id age  gender occupation        zip_code
    1       24   M        technician        85711
    2       53   F        other             94043
    3       23   M        writer            32067
    4       24   M        technician        43537
    5       33   F        other             15213

Run Code Online (Sandbox Code Playgroud)

我想知道每个项目的男女比例是多少occupation

我已经使用了下面给定的函数,但这不是最佳方法。

df.groupby(['occupation', 'gender']).agg({'gender':'count'}).div(df.groupby('occupation').agg('count'), level='occupation')['gender']*100

Run Code Online (Sandbox Code Playgroud)

这给了我们类似的结果

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
Run Code Online (Sandbox Code Playgroud)

上面的答案的格式非常不同,因为我想要类似的东西:(演示)

occupation      M:F

programmer      2:3
farmer          7:2
Run Code Online (Sandbox Code Playgroud)

有人可以告诉我如何制作自己的聚合函数吗?

Pav*_*apu 0

这对你有用吗

df_g = df.groupby(['occupation', 'gender']).count().user_id/df.groupby(['occupation']).count().user_id
df_g = df_g.reset_index()
df_g['ratio'] = df_g['user_id'].apply(lambda x: str(Fraction(x).limit_denominator()).replace('/',':'))
Run Code Online (Sandbox Code Playgroud)

输出

       occupation gender   user_id  ratio
0   administrator      F  0.455696  36:79
1   administrator      M  0.544304  43:79
2          artist      F  0.464286  13:28
3          artist      M  0.535714  15:28
4          doctor      M  1.000000      1
5        educator      F  0.273684  26:95
6        educator      M  0.726316  69:95
7        engineer      F  0.029851   2:67
8        engineer      M  0.970149  65:67
9   entertainment      F  0.111111    1:9
10  entertainment      M  0.888889    8:9
11      executive      F  0.093750   3:32
12      executive      M  0.906250  29:32
13     healthcare      F  0.687500  11:16
14     healthcare      M  0.312500   5:16
15      homemaker      F  0.857143    6:7
16      homemaker      M  0.142857    1:7
17         lawyer      F  0.166667    1:6
18         lawyer      M  0.833333    5:6
19      librarian      F  0.568627  29:51
20      librarian      M  0.431373  22:51
21      marketing      F  0.384615   5:13
22      marketing      M  0.615385   8:13
23           none      F  0.444444    4:9
24           none      M  0.555556    5:9
25          other      F  0.342857  12:35
26          other      M  0.657143  23:35
27     programmer      F  0.090909   1:11
28     programmer      M  0.909091  10:11
29        retired      F  0.071429   1:14
30        retired      M  0.928571  13:14
31       salesman      F  0.250000    1:4
32       salesman      M  0.750000    3:4
33      scientist      F  0.096774   3:31
34      scientist      M  0.903226  28:31
35        student      F  0.306122  15:49
36        student      M  0.693878  34:49
37     technician      F  0.037037   1:27
38     technician      M  0.962963  26:27
39         writer      F  0.422222  19:45
40         writer      M  0.577778  26:45
Run Code Online (Sandbox Code Playgroud)