cht*_*mon 5 python excel pandas
我想从具有以下三个级别的“合并”标头的Excel文件中产生整洁的数据:
Pandas可以很好地读取文件,并带有多级标题:
# df = pandas.read_excel('test.xlsx', header=[0,1,2])
Run Code Online (Sandbox Code Playgroud)
为了提高重复性,您可以复制粘贴以下内容:
df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})
Run Code Online (Sandbox Code Playgroud)
我想对此进行标准化,以便级别标题位于可变行中,但将列a,b和c保留为列:
如果没有多层标题,我将尽力pandas.melt(df, id_vars=['a', 'b', 'c'])获得想要的东西。pandas.melt(df)给了我想要的三个变量列,但显然并没有保留a,b和c列。
它应该像这样简单:
wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])
long_df = wide_df.stack().stack().stack()
Run Code Online (Sandbox Code Playgroud)
下面是一个 CSV 模型文件的示例(请注意第四行用于标记索引,第一列用于标记标题级别):
from io import StringIO
from textwrap import dedent
import pandas
mockcsv = StringIO(dedent("""\
num,,,this1,this1,this1,this1,that1,that1,that1,that1
let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
a,b,c,,,,,,,,
a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))
wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()
Run Code Online (Sandbox Code Playgroud)
所以wide_df看起来像这样:
num this1 that1
let thisA thatA thisB thatB
animal cat dog bird lizard cat dog bird lizard
a b c
a1 b1 c1 x1 x2 x3 x4 x5 x6 x7 x8
c2 y1 y2 y3 y4 y5 y6 y7 y8
b2 c1 z1 z2 z3 z4 z5 6z zy z8
Run Code Online (Sandbox Code Playgroud)
和long_df
a b c animal let num
a1 b1 c1 bird thatA this1 x3
thatB that1 x7
cat thisA this1 x1
thisB that1 x5
dog thisA this1 x2
thisB that1 x6
lizard thatA this1 x4
thatB that1 x8
c2 bird thatA this1 y3
thatB that1 y7
cat thisA this1 y1
thisB that1 y5
dog thisA this1 y2
thisB that1 y6
lizard thatA this1 y4
thatB that1 y8
b2 c1 bird thatA this1 z3
thatB that1 zy
cat thisA this1 z1
thisB that1 z5
dog thisA this1 z2
thisB that1 6z
lizard thatA this1 z4
thatB that1 z8
Run Code Online (Sandbox Code Playgroud)
通过OP中显示的文字数据,您可以通过执行以下操作来获得此数据而无需修改任何内容:
index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
.reset_index(level=0, drop=True)
.rename_axis(index_names, axis='index')
.rename_axis(col_names, axis='columns')
.stack()
.stack()
.stack()
.to_frame()
)
Run Code Online (Sandbox Code Playgroud)
我认为棘手的部分是检查每个文件以找出index_names应该是什么。
| 归档时间: |
|
| 查看次数: |
553 次 |
| 最近记录: |