使用基于另一列的 groupby 的最小最大标准化来标准化数据帧的列

Vam*_*ala 2 python dataframe pandas pandas-groupby

数据框如图所示

Name     Job      Salary
john   painter    40000
peter  engineer   50000
sam     plumber   30000
john    doctor    500000
john    driver    20000
sam    carpenter  10000
peter  scientist  100000
Run Code Online (Sandbox Code Playgroud)

如何按“名称”列进行分组并对每个组的“薪水”列应用标准化?

预期结果:

Name     Job      Salary
john   painter    0.041666
peter  engineer   0
sam     plumber   1
john    doctor    1
john    driver    0
sam    carpenter  0
peter  scientist  1
Run Code Online (Sandbox Code Playgroud)

我尝试过以下方法

data = df.groupby('Name').transform(lambda x: (x - x.min()) / x.max()- x.min())
Run Code Online (Sandbox Code Playgroud)

然而,这会产生

         Salary
0 -19999.960000
1 -50000.000000
2  -9999.333333
3 -19999.040000
4 -20000.000000
5 -10000.000000
6 -49999.500000
Run Code Online (Sandbox Code Playgroud)

tim*_*geb 5

你快到了。

\n\n
>>> df                                                                                                                 \n    Name        Job  Salary\n0   john    painter   40000\n1  peter   engineer   50000\n2    sam    plumber   30000\n3   john     doctor  500000\n4   john     driver   20000\n5    sam  carpenter   10000\n6  peter  scientist  100000\n>>>                                                                                                                    \n>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))\n>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame       \n>>> result                                                                                                               \n    Name        Job    Salary\n0   john    painter  0.041667\n1  peter   engineer  0.000000\n2    sam    plumber  1.000000\n3   john     doctor  1.000000\n4   john     driver  0.000000\n5    sam  carpenter  0.000000\n6  peter  scientist  1.000000\n
Run Code Online (Sandbox Code Playgroud)\n\n

所以基本上,你只是忘了用x.max() - x.min()括号括起来。

\n\n
\n\n

请注意,通过一系列矢量化操作可以更快地完成此操作。

\n\n
>>> grouper = df.groupby('Name')['Salary']                                                                             \n>>> maxes = grouper.transform('max')                                                                                   \n>>> mins = grouper.transform('min')                                                                                    \n>>>                                                                                                                    \n>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))                                                       \n>>> result                                                                                                             \n    Name        Job    Salary\n0   john    painter  0.041667\n1  peter   engineer  0.000000\n2    sam    plumber  1.000000\n3   john     doctor  1.000000\n4   john     driver  0.000000\n5    sam  carpenter  0.000000\n6  peter  scientist  1.000000\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n\n

时间:

\n\n
>>> # Setup\n>>> df = pd.concat([df]*1000, ignore_index=True)                                                                       \n>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group                                                      \n>>> df                                                                                                                 \n      Name        Job  Salary\n0        0    painter   40000\n1        0   engineer   50000\n2        0    plumber   30000\n3        0     doctor  500000\n4        1     driver   20000\n...    ...        ...     ...\n6995  1748    plumber   30000\n6996  1749     doctor  500000\n6997  1749     driver   20000\n6998  1749  carpenter   10000\n6999  1749  scientist  100000\n\n[7000 rows x 3 columns]\n>>>\n>>> # Tests @ i5-6200U CPU @ 2.30GHz\n>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))                                 \n1.19 s \xc2\xb1 20.3 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n>>> %%timeit \n...: grouper = df.groupby('Name')['Salary'] \n...: maxes = grouper.transform('max') \n...: mins = grouper.transform('min') \n...: (df.Salary - mins)/(maxes - mins) \n...:  \n...:                                                                                                                   \n3.04 ms \xc2\xb1 94.5 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n