Vam*_*ala 2 python dataframe pandas pandas-groupby
数据框如图所示
Name Job Salary
john painter 40000
peter engineer 50000
sam plumber 30000
john doctor 500000
john driver 20000
sam carpenter 10000
peter scientist 100000
Run Code Online (Sandbox Code Playgroud)
如何按“名称”列进行分组并对每个组的“薪水”列应用标准化?
预期结果:
Name Job Salary
john painter 0.041666
peter engineer 0
sam plumber 1
john doctor 1
john driver 0
sam carpenter 0
peter scientist 1
Run Code Online (Sandbox Code Playgroud)
我尝试过以下方法
data = df.groupby('Name').transform(lambda x: (x - x.min()) / x.max()- x.min())
Run Code Online (Sandbox Code Playgroud)
然而,这会产生
Salary
0 -19999.960000
1 -50000.000000
2 -9999.333333
3 -19999.040000
4 -20000.000000
5 -10000.000000
6 -49999.500000
Run Code Online (Sandbox Code Playgroud)
你快到了。
\n\n>>> df \n Name Job Salary\n0 john painter 40000\n1 peter engineer 50000\n2 sam plumber 30000\n3 john doctor 500000\n4 john driver 20000\n5 sam carpenter 10000\n6 peter scientist 100000\n>>> \n>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))\n>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame \n>>> result \n Name Job Salary\n0 john painter 0.041667\n1 peter engineer 0.000000\n2 sam plumber 1.000000\n3 john doctor 1.000000\n4 john driver 0.000000\n5 sam carpenter 0.000000\n6 peter scientist 1.000000\nRun Code Online (Sandbox Code Playgroud)\n\n所以基本上,你只是忘了用x.max() - x.min()括号括起来。
请注意,通过一系列矢量化操作可以更快地完成此操作。
\n\n>>> grouper = df.groupby('Name')['Salary'] \n>>> maxes = grouper.transform('max') \n>>> mins = grouper.transform('min') \n>>> \n>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins)) \n>>> result \n Name Job Salary\n0 john painter 0.041667\n1 peter engineer 0.000000\n2 sam plumber 1.000000\n3 john doctor 1.000000\n4 john driver 0.000000\n5 sam carpenter 0.000000\n6 peter scientist 1.000000\nRun Code Online (Sandbox Code Playgroud)\n\n时间:
\n\n>>> # Setup\n>>> df = pd.concat([df]*1000, ignore_index=True) \n>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group \n>>> df \n Name Job Salary\n0 0 painter 40000\n1 0 engineer 50000\n2 0 plumber 30000\n3 0 doctor 500000\n4 1 driver 20000\n... ... ... ...\n6995 1748 plumber 30000\n6996 1749 doctor 500000\n6997 1749 driver 20000\n6998 1749 carpenter 10000\n6999 1749 scientist 100000\n\n[7000 rows x 3 columns]\n>>>\n>>> # Tests @ i5-6200U CPU @ 2.30GHz\n>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())) \n1.19 s \xc2\xb1 20.3 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n>>> %%timeit \n...: grouper = df.groupby('Name')['Salary'] \n...: maxes = grouper.transform('max') \n...: mins = grouper.transform('min') \n...: (df.Salary - mins)/(maxes - mins) \n...: \n...: \n3.04 ms \xc2\xb1 94.5 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
4052 次 |
| 最近记录: |