Ank*_*Ank 7 python data-analysis dataframe pandas
我有一个如下所示的数据框,其中第一列包含日期,其他列包含这些日期的数据:
date k1-v1 k1-v2 k2-v1 k2-v2 k1k3-v1 k1k3-v2 k4-v1 k4-v2
0 2021-01-05 2.0 7.0 NaN NaN NaN NaN 9.0 6.0
1 2021-01-31 NaN NaN 8.0 5.0 NaN NaN 7.0 6.0
2 2021-02-15 9.0 5.0 NaN 3.0 4.0 NaN NaN NaN
3 2021-02-28 NaN 9.0 0.0 1.0 NaN NaN 8.0 8.0
4 2021-03-20 7.0 NaN NaN NaN NaN NaN NaN NaN
5 2021-03-31 NaN NaN 8.0 NaN 3.0 NaN 8.0 0.0
6 2021-04-10 NaN NaN 7.0 6.0 NaN NaN NaN 9.0
7 2021-04-30 NaN 6.0 NaN NaN NaN NaN 1.0 NaN
8 2021-05-14 8.0 NaN 3.0 3.0 4.0 NaN NaN NaN
9 2021-05-31 NaN NaN 2.0 1.0 NaN NaN NaN NaN
Run Code Online (Sandbox Code Playgroud)
列是总是在对:; ; 依此类推N对。但成对列并不总是按这个顺序排列。所以k1-v1后面不一定只有k1-v2,但数据帧中的某处会有k1-v2列。为简单起见,我并排展示了它们。(k1-v1, k1-v2)(k2-v1, k2-v2)(k1k3-v1, k1k3-v2)
我需要在每对列中找到最后一个有效数据 日期,并将其总结如下:
keys v1-last v2-last
0 k1 2021-05-14 2021-04-30
1 k2 2021-05-31 2021-05-31
2 k1k3 2021-05-14 NaN
3 k4 2021-04-30 2021-04-10
Run Code Online (Sandbox Code Playgroud)
所以对于最后一个有效数据是在日期,对于它的on 。然后为k1相应地填充上面数据框中的列和,其他类似。(k1-v1)8.02021-05-14(k2-v2)6.02021-04-30v1-lastv2-last
目前我正在这样做,这在较大的数据集上不是很有效:
df.set_index('date', inplace=True)
unique_cols = set([col[0] for col in df.columns.str.split('-')])
summarized_data = []
for col in unique_cols:
pair_df = df.loc[:,[col+'-v1',col+'-v2']].dropna(how='all')
v1_last_valid = pair_df.iloc[:,0].last_valid_index()
v2_last_valid = pair_df.iloc[:,1].last_valid_index()
summarized_data.append([col, v1_last_valid, v2_last_valid])
summarized_df = pd.DataFrame(summarized_data, columns=['keys','v1-last','v2-last'])
Run Code Online (Sandbox Code Playgroud)
这现在有效,并给了我预期的结果,但在大型数据集上运行时需要大量时间。是否可以避免循环并以不同且有效的方式完成?
s = df.set_index('date').stack()
s = s.reset_index().drop_duplicates('level_1', keep='last')
s[['keys', 'val']] = s['level_1'].str.split('-', expand=True)
s = s.pivot('keys', 'val', 'date').add_suffix('-last')
Run Code Online (Sandbox Code Playgroud)
将数据帧的索引设置为date并stack重塑
date
2021-01-05 k1-v1 2.0
k1-v2 7.0
k4-v1 9.0
k4-v2 6.0
2021-01-31 k2-v1 8.0
k2-v2 5.0
k4-v1 7.0
k4-v2 6.0
...
2021-05-31 k2-v1 2.0
k2-v2 1.0
dtype: float64
Run Code Online (Sandbox Code Playgroud)
重置索引并删除具有重复值的行level_1
date level_1 0
24 2021-04-10 k4-v2 9.0
25 2021-04-30 k1-v2 6.0
26 2021-04-30 k4-v1 1.0
27 2021-05-14 k1-v1 8.0
30 2021-05-14 k1k3-v1 4.0
31 2021-05-31 k2-v1 2.0
32 2021-05-31 k2-v2 1.0
Run Code Online (Sandbox Code Playgroud)
Split列中的字符串level_1以创建两个附加列keys,并且 val
date level_1 0 keys val
24 2021-04-10 k4-v2 9.0 k4 v2
25 2021-04-30 k1-v2 6.0 k1 v2
26 2021-04-30 k4-v1 1.0 k4 v1
27 2021-05-14 k1-v1 8.0 k1 v1
30 2021-05-14 k1k3-v1 4.0 k1k3 v1
31 2021-05-31 k2-v1 2.0 k2 v1
32 2021-05-31 k2-v2 1.0 k2 v2
Run Code Online (Sandbox Code Playgroud)
Pivot要重塑数据框并向-last列名称添加后缀
val v1-last v2-last
keys
k1 2021-05-14 2021-04-30
k1k3 2021-05-14 NaN
k2 2021-05-31 2021-05-31
k4 2021-04-30 2021-04-10
Run Code Online (Sandbox Code Playgroud)