通过熊猫从多层Excel文件中整理数据

cht*_*mon 5 python excel pandas

我想从具有以下三个级别的“合并”标头的Excel文件中产生整洁的数据:

在此处输入图片说明

Pandas可以很好地读取文件,并带有多级标题:

# df = pandas.read_excel('test.xlsx', header=[0,1,2])
Run Code Online (Sandbox Code Playgroud)

为了提高重复性,您可以复制粘贴以下内容:

df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})
Run Code Online (Sandbox Code Playgroud)

我想对此进行标准化,以便级别标题位于可变行中,但将列a,b和c保留为列:

所需的输出

如果没有多层标题,我将尽力pandas.melt(df, id_vars=['a', 'b', 'c'])获得想要的东西。pandas.melt(df)给了我想要的三个变量列,但显然并没有保留a,b和c列。

Pau*_*l H 4

它应该像这样简单:

wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])

long_df = wide_df.stack().stack().stack()
Run Code Online (Sandbox Code Playgroud)

下面是一个 CSV 模型文件的示例(请注意第四行用于标记索引,第一列用于标记标题级别):

from io import StringIO
from textwrap import dedent

import pandas

mockcsv = StringIO(dedent("""\
    num,,,this1,this1,this1,this1,that1,that1,that1,that1
    let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
    animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
    a,b,c,,,,,,,,
    a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
    a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
    a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))


wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()
Run Code Online (Sandbox Code Playgroud)

所以wide_df看起来像这样:

num      this1                  that1                 
let      thisA     thatA        thisB     thatB       
animal     cat dog  bird lizard   cat dog  bird lizard
a  b  c                                               
a1 b1 c1    x1  x2    x3     x4    x5  x6    x7     x8
      c2    y1  y2    y3     y4    y5  y6    y7     y8
   b2 c1    z1  z2    z3     z4    z5  6z    zy     z8
Run Code Online (Sandbox Code Playgroud)

long_df

a   b   c   animal  let    num  
a1  b1  c1  bird    thatA  this1    x3
                    thatB  that1    x7
            cat     thisA  this1    x1
                    thisB  that1    x5
            dog     thisA  this1    x2
                    thisB  that1    x6
            lizard  thatA  this1    x4
                    thatB  that1    x8
        c2  bird    thatA  this1    y3
                    thatB  that1    y7
            cat     thisA  this1    y1
                    thisB  that1    y5
            dog     thisA  this1    y2
                    thisB  that1    y6
            lizard  thatA  this1    y4
                    thatB  that1    y8
    b2  c1  bird    thatA  this1    z3
                    thatB  that1    zy
            cat     thisA  this1    z1
                    thisB  that1    z5
            dog     thisA  this1    z2
                    thisB  that1    6z
            lizard  thatA  this1    z4
                    thatB  that1    z8
Run Code Online (Sandbox Code Playgroud)

通过OP中显示的文字数据,您可以通过执行以下操作来获得此数据而无需修改任何内容:

index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
    pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
        .reset_index(level=0, drop=True)
        .rename_axis(index_names, axis='index')
        .rename_axis(col_names, axis='columns')
        .stack()
        .stack()
        .stack()
        .to_frame()
)
Run Code Online (Sandbox Code Playgroud)

我认为棘手的部分是检查每个文件以找出index_names应该是什么。