如何在Pandas中按子级索引进行过滤

big*_*bug 4 python pandas

我有一个'df',它有一个多级索引(STK_ID,RPT_Date)

                       sales         cogs     net_pft
STK_ID RPT_Date                                      
000876 20060331          NaN          NaN         NaN
       20060630    857483000    729541000    67157200
       20060930   1063590000    925140000    50807000
       20061231    853960000    737660000    51574000
       20070331  -2695245000  -2305078000  -167642500
       20070630   1146245000   1050808000   113468500
       20070930   1327970000   1204800000    84337000
       20071231   1439140000   1331870000    53398000
       20080331  -3135240000  -2798090000  -248054300
       20080630   1932470000   1777010000   133756300
       20080930   1873240000   1733660000    92099000
002254 20061231 -16169620000 -15332705000  -508333200
       20070331   -763844000   -703460000    -1538000
       20070630    501221000    289167000   118012200
       20070930    460483000    274026000    95967000
Run Code Online (Sandbox Code Playgroud)

如何编写命令来过滤'RPT_Date'包含'0630'的行(这是Q2报告)?结果应该是:

                       sales         cogs     net_pft
STK_ID RPT_Date                                      
000876 20060630    857483000    729541000    67157200
       20070630   1146245000   1050808000   113468500
       20080630   1932470000   1777010000   133756300
002254 20070630    501221000    289167000   118012200
Run Code Online (Sandbox Code Playgroud)

我正在尝试使用df[df['RPT_Date'].str.contains('0630')],但是Pandas拒绝工作,因为'RPT_Date'它不是一个列而是一个sub_level索引.

谢谢你的提示......

Gar*_*ett 15

要在列上使用"str.*"方法,可以重置索引,使用列"str.*"方法调用过滤行,然后重新创建索引.

In [72]: x = df.reset_index(); x[x.RPT_Date.str.endswith("0630")].set_index(['STK_ID', 'RPT_Date'])
Out[72]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
000876 20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
002254 20070630   501221000   289167000  118012200
Run Code Online (Sandbox Code Playgroud)

但是,这种方法并不是特别快.

In [73]: timeit x = df.reset_index(); x[x.RPT_Date.str.endswith("0630")].set_index(['STK_ID', 'RPT_Date'])
1000 loops, best of 3: 1.78 ms per loop
Run Code Online (Sandbox Code Playgroud)

另一种方法建立在MultiIndex对象的行为与元组列表非常相似的事实上.

In [75]: df.index
Out[75]: 
MultiIndex
[('000876', '20060331') ('000876', '20060630') ('000876', '20060930')
 ('000876', '20061231') ('000876', '20070331') ('000876', '20070630')
 ('000876', '20070930') ('000876', '20071231') ('000876', '20080331')
 ('000876', '20080630') ('000876', '20080930') ('002254', '20061231')
 ('002254', '20070331') ('002254', '20070630') ('002254', '20070930')]
Run Code Online (Sandbox Code Playgroud)

在此基础上,您可以使用df.index.map()从MultiIndex创建一个布尔数组,并使用结果来过滤帧.

In [76]: df[df.index.map(lambda x: x[1].endswith("0630"))]
Out[76]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
000876 20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
002254 20070630   501221000   289167000  118012200
Run Code Online (Sandbox Code Playgroud)

这也快得多.

In [77]: timeit df[df.index.map(lambda x: x[1].endswith("0630"))]
1000 loops, best of 3: 240 us per loop
Run Code Online (Sandbox Code Playgroud)

  • 您也可以按名称访问索引级别:`df [df.index.get_level_values('RPT_Date')。str.endswith('0630')]` (2认同)