Cle*_*leb 5 python dataframe pandas
我有两个这样的数据框
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'key': list('AAABBCCAAC'),
'prop1': list('xyzuuyxzzz'),
'prop2': list('mnbnbbnnnn')
})
df2 = pd.DataFrame({
'key': list('ABBCAA'),
'prop1': [np.nan] * 6,
'prop2': [np.nan] * 6,
'keep_me': ['stuff'] * 6
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 C x n
7 A z n
8 A z n
9 C z n
key prop1 prop2 keep_me
0 A NaN NaN stuff
1 B NaN NaN stuff
2 B NaN NaN stuff
3 C NaN NaN stuff
4 A NaN NaN stuff
5 A NaN NaN stuff
Run Code Online (Sandbox Code Playgroud)
我现在想填充柱prop1和prop2在df2使用的值df1。对于每个键,我们将有df1比in中更多或相等的行df2(在上面的示例中:5次Avs 3次A,2次Bvs 2次B和3次Cvs 1次C)。对于每个键,我想df2使用中n每个键的前几行df1。
因此,我的预期结果df2将是:
key prop1 prop2 keep_me
0 A x m stuff
1 B u n stuff
2 B u b stuff
3 C y b stuff
4 A y n stuff
5 A z b stuff
Run Code Online (Sandbox Code Playgroud)
由于key不是唯一的,所以我不能简单地构建字典然后使用.map。
我希望遵循这些思路的东西能够起作用:
pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')
Run Code Online (Sandbox Code Playgroud)
但这失败了
ValueError:传递的值的形状为(5,22),索引暗示(5,10)
如-我猜-索引包含非唯一值。
如何获得所需的输出?
由于key值重复可能的解决方案是在两个都DataFrame通过中创建新的计数器列GroupBy.cumcount,因此可能用替换df2为align by中的缺失值,并使用和MultiIndex创建列:keygDataFrame.fillna
df1['g'] = df1.groupby('key').cumcount()
df2['g'] = df2.groupby('key').cumcount()
print (df1)
key prop1 prop2 g
0 A x m 0
1 A y n 1
2 A z b 2
3 B u n 0
4 B u b 1
5 C y b 0
6 C x n 1
7 A z n 3
8 A z n 4
9 C z n 2
print (df2)
key prop1 prop2 keep_me g
0 A NaN NaN stuff 0
1 B NaN NaN stuff 0
2 B NaN NaN stuff 1
3 C NaN NaN stuff 0
4 A NaN NaN stuff 1
5 A NaN NaN stuff 2
Run Code Online (Sandbox Code Playgroud)
df = (df2.set_index(['key','g'])
.fillna(df1.set_index(['key','g']))
.reset_index(level=1, drop=True)
.reset_index())
print (df)
key prop1 prop2 keep_me
0 A x m stuff
1 B u n stuff
2 B u b stuff
3 C y b stuff
4 A y n stuff
5 A z b stuff
Run Code Online (Sandbox Code Playgroud)