Sha*_*ang 5 python dataframe pandas
这是我的pandas DataFrame:
import pandas as pd
import numpy as np
data = {"column1": [338, 519, 871, 1731, 2693, 2963, 3379, 3789, 3910, 4109, 4307, 4800, 4912, 5111, 5341, 5820, 6003, ...],
"column2": [NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, NaN, NaN, 3, 3, 3, 3, 3, NaN, NaN], ...}
df = pd.DataFrame(data)
df
>>> column1 column2
0 338 NaN
1 519 1.0
2 871 1.0
3 1731 1.0
4 2693 1.0
5 2963 NaN
6 3379 NaN
7 3789 2.0
8 3910 2.0
9 4109 NaN
10 4307 NaN
11 4800 3.0
12 4912 3.0
13 5111 3.0
14 5341 3.0
15 5820 3.0
16 6003 NaN
17 .... ....
Run Code Online (Sandbox Code Playgroud)
整数column2表示"组" column1,例如行1-4是组"1",行7-8是组"2",行11-15是组"3"等.
我想计算每组中第一行和最后一行之间的差异.结果数据框如下所示:
df
>>> column1 column2 column3
0 338 NaN NaN
1 519 1.0 2174
2 871 1.0 2174
3 1731 1.0 2174
4 2693 1.0 2174
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121
8 3910 2.0 121
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020
12 4912 3.0 1020
13 5111 3.0 1020
14 5341 3.0 1020
15 5820 3.0 1020
16 6003 NaN NaN
17 .... .... ...
Run Code Online (Sandbox Code Playgroud)
因为:
2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020
Run Code Online (Sandbox Code Playgroud)
什么是"熊猫方式"来计算column3?不知何故,必须迭代column3,寻找连续的值组df.column2 != "NaN".
编辑:我意识到我的例子可能会引导读者假设价值column1只会增加.实际上,有间隔,列intervals
df = pd.DataFrame(data)
df
>>> interval column1 column2
0 interval1 338 NaN
1 interval1 519 1.0
2 interval1 871 1.0
3 interval1 1731 1.0
4 interval1 2693 1.0
5 interval1 2963 NaN
6 interval1 3379 NaN
7 interval1 3789 2.0
8 interval1 3910 2.0
9 interval1 4109 NaN
10 interval1 4307 NaN
11 interval1 4800 3.0
12 interval1 4912 3.0
13 interval1 5111 3.0
14 interval1 5341 3.0
15 interval1 5820 3.0
16 interval1 6003 NaN
17 .... ....
18 interval2 12 13
19 interval2 115 13
20 interval2 275 NaN
....
Run Code Online (Sandbox Code Playgroud)
您可以先过滤,然后获取第一个和最后一个值的差异transform:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
column1 column2 col3
0 338 NaN NaN
1 519 1.0 2174.0
2 871 1.0 2174.0
3 1731 1.0 2174.0
4 2693 1.0 2174.0
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121.0
8 3910 2.0 121.0
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020.0
12 4912 3.0 1020.0
13 5111 3.0 1020.0
14 5341 3.0 1020.0
15 5820 3.0 1020.0
16 6003 NaN NaN
Run Code Online (Sandbox Code Playgroud)
按您的新数据编辑1:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
interval column1 column2 col3
0 interval1 338 NaN NaN
1 interval1 519 1.0 2174.0
2 interval1 871 1.0 2174.0
3 interval1 1731 1.0 2174.0
4 interval1 2693 1.0 2174.0
5 interval1 2963 NaN NaN
6 interval1 3379 NaN NaN
7 interval1 3789 2.0 121.0
8 interval1 3910 2.0 121.0
9 interval1 4109 NaN NaN
10 interval1 4307 NaN NaN
11 interval1 4800 3.0 1020.0
12 interval1 4912 3.0 1020.0
13 interval1 5111 3.0 1020.0
14 interval1 5341 3.0 1020.0
15 interval1 5820 3.0 1020.0
16 interval1 6003 NaN NaN
18 interval2 12 13.0 103.0
19 interval2 115 13.0 103.0
20 interval2 275 NaN NaN
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1576 次 |
| 最近记录: |