是否有一种简单的方法来扩展/完成 Pandas DataFrame 以包含多列缺失的观察结果？

Question

是否有一种简单的方法来扩展/完成 Pandas DataFrame 以包含多列缺失的观察结果？

我有一个如下所示的 DataFrame：

>>> df = pd.DataFrame({
    'category1': list('AABAAB'),
    'category2': list('xyxxyx'),
    'year': [2000, 2000, 2000, 2002, 2002, 2002],
    'value': [0, 1, 0, 4, 3, 4]
})

>>> df
  category1 category2  year  value
0         A         x  2000      0
1         A         y  2000      1
2         B         x  2000      0
3         A         x  2002      4
4         A         y  2002      3
5         B         x  2002      4

Run Code Online (Sandbox Code Playgroud)

我想扩展数据以包括某个范围内的缺失年份。例如，如果范围是range(2000, 2003)，则扩展后的 DataFrame 应如下所示：

  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

Run Code Online (Sandbox Code Playgroud)

我曾尝试使用一种方法pd.MultiIndex.from_product，但创建不在的有效组合行category1和category2（例如，B并且y不应该走在一起）。使用from_product然后过滤对于我的实际数据来说太慢了，其中包括更多的组合。

有没有更简单的解决方案可以很好地扩展？

编辑

这是我最终采用的解决方案，试图将问题概括一下：

id_cols = ['category1', 'category2']

df_out = (df.pivot_table(index=id_cols, values='value', columns='year')
            .reindex(columns=range(2000, 2003))
            .stack(dropna=False)
            .sort_index(level=-1)
            .reset_index(name='value'))

  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

Run Code Online (Sandbox Code Playgroud)

Answer 1

WeN*_*Ben 8

让我们做stack和unstack

dfout=df.set_index(['year','category1','category2']).\
         value.unstack(level=0).\
         reindex(columns=range(2000,2003)).\
         stack(dropna=False).to_frame('value').\
         sort_index(level=2).reset_index()
  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

Run Code Online (Sandbox Code Playgroud)

我们可以用 df.pivot_table(index=['category1,'category2'],values 替换 df.set_index(['year','category1','category2']).value.unstack(level=0)` =['值'],列=['年份'])`? (3认同)

归档时间：	5 年，9 月前
查看次数：	173 次
最近记录：	5 年，1 月前