从分隔的字符串列创建多级索引pandas数据框

Question

从分隔的字符串列创建多级索引pandas数据框

我有以下内容DataFrame:

import pandas as pd

df = pd.DataFrame({
    'col1': ['a, b'],
    'col2': [100]
}, index=['A'])

Run Code Online (Sandbox Code Playgroud)

我想要实现的是通过"爆炸" col1创建一个多级索引,其值为col1第二级 - 同时保留col2原始索引的值,例如:

idx_1,idx_2,val
A,a,100
A,b,100

Run Code Online (Sandbox Code Playgroud)

我确定我需要一个col1.str.split(', '),但我完全失去了如何创建所需的结果 - 也许我需要一个pivot_table但却看不出我怎么能得到所需的索引.

我花了一个半小时的时间来看看有关重塑和旋转等问题的文档......我确信它是直截了当的 - 我只是不知道找到"正确的东西"所需的术语.

Answer 1

J R*_*ape 6

在这里调整第一个答案,这是一种方法.您可能想要使用这些名称来获取您想要的名称.

如果您最终的目标是为非常大的数据帧执行此操作,则可能有更有效的方法来执行此操作.

import pandas as pd
from pandas import Series

# Create test dataframe
df = pd.DataFrame({'col1': ['a, b'], 'col2': [100]}, index=['A'])

#split the values in column 1 and then stack them up in a big column
s = df.col1.str.split(', ').apply(Series, 1).stack()

# get rid of the last column from the *index* of this stack 
# (it was all meaningless numbers if you look at it)
s.index = s.index.droplevel(-1)

# just give it a name - I've picked yours from OP
s.name = 'idx_2'

del df['col1']  
df = df.join(s)
# At this point you're more or less there

# If you truly want 'idx_2' as part of the index - do this
indexed_df = df.set_index('idx_2', append=True)

Run Code Online (Sandbox Code Playgroud)

使用原始数据帧作为输入,代码将此作为输出:

>>> indexed_df
         col2
  idx_2
A a       100
  b       100

Run Code Online (Sandbox Code Playgroud)

进一步的操纵

如果你想给索引一些有意义的名字 - 你可以使用

indexed_df.index.names = ['idx_1','idx_2']

Run Code Online (Sandbox Code Playgroud)

给出输出

             col2
idx_1 idx_2
A     a       100
      b       100

Run Code Online (Sandbox Code Playgroud)

如果您真的希望将索引展平为列,请使用此选项

indexed_df.reset_index(inplace=True)

Run Code Online (Sandbox Code Playgroud)

给出输出

>>> indexed_df
    idx_1 idx_2  col2
0       A     a   100
1       A     b   100
>>>

Run Code Online (Sandbox Code Playgroud)

更复杂的输入

如果您尝试稍微更有趣的示例输入 - 例如

>>> df = pd.DataFrame({
...     'col1': ['a, b', 'c, d'],
...     'col2': [100,50]
... }, index = ['A','B'])

Run Code Online (Sandbox Code Playgroud)

你滚出去:

>>> indexed_df
         col2
  idx_2
A a       100
  b       100
B c        50
  d        50

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，8 月前
查看次数：	1553 次
最近记录：	10 年，8 月前