blz*_*blz 4 python indexing vectorization pandas
这是一个示例DataFrame,我将用它来更好地说明我的问题:
import pandas as pd
df = pd.DataFrame(pd.np.random.rand(30, 3), columns=tuple('ABC'))
df['event'] = pd.np.nan
df.loc[10, 'event'] = 'ping'
df.loc[20, 'event'] = 'ping'
df.loc[19, 'event'] = 'pong'
Run Code Online (Sandbox Code Playgroud)
我需要创建以每次出现为中心的n行窗口ping.
换句话说,我们i是一个包含行的索引ping中的event列.对于每一个i,我想选择df.ix[i-n:i+n].
因此,n=3我希望得到以下结果:
A B C event
7 0.8295863 0.2162861 0.4856461 NaN
8 0.156646 0.4730667 0.9968878 NaN
9 0.6709413 0.4796197 0.8747416 NaN
10 0.09942329 0.154008 0.5761598 ping
11 0.7168143 0.678207 0.7281105 NaN
12 0.8915475 0.8013187 0.9049722 NaN
13 0.9545411 0.4844835 0.1645746 NaN
17 0.9909208 0.1091025 0.6582635 NaN
18 0.2536326 0.4324749 0.8001643 NaN
19 0.4734659 0.5582809 0.1221296 pong
20 0.7230407 0.6695843 0.3902591 ping
21 0.3624909 0.2685049 0.5484445 NaN
22 0.05626284 0.6113877 0.9131929 NaN
23 0.8312294 0.5694373 0.4325798 NaN
[14 rows x 4 columns]
Run Code Online (Sandbox Code Playgroud)
一些警告:
pong我们不希望将窗口居中的值.ping然而,它是以围绕第二个为中心的结果捕获的.怎么能实现这一目标?
In [17]: n = 3
Run Code Online (Sandbox Code Playgroud)
选择一个索引器,它是您需要的范围,例如目标索引+ - 3(取决于框架大小的最大/最小值).将它们全部连接起来,并消除重复.
In [18]: indexers = np.unique(np.concatenate([ np.arange(max(i-n,0),min(i+n,len(df))) for i in df[df.event=='ping'].index ]))
In [19]: indexers
Out[19]: array([ 7, 8, 9, 10, 11, 12, 17, 18, 19, 20, 21, 22])
Run Code Online (Sandbox Code Playgroud)
选择它们.
In [20]: df.iloc[indexers]
Out[20]:
A B C event
7 0.03348742 0.05735324 0.1220022 NaN
8 0.9567363 0.6539097 0.8409577 NaN
9 0.3115902 0.4955503 0.1749197 NaN
10 0.6883777 0.6185107 0.7933182 ping
11 0.5185129 0.6533616 0.1569159 NaN
12 0.1196976 0.9638604 0.7318006 NaN
17 0.02897615 0.1224485 0.5706852 NaN
18 0.02409971 0.4715463 0.4587161 NaN
19 0.9070592 0.3371241 0.9543977 pong
20 0.8533369 0.7549413 0.5334882 ping
21 0.9546738 0.8203931 0.8543028 NaN
22 0.05691086 0.2402766 0.3922318 NaN
Run Code Online (Sandbox Code Playgroud)
请注意,您可能需要执行a df.reset_index()(在选择获取实际行索引位置而不是值之前).
请注意,它们是一个错误,因为"事件"列的设置会将所有内容转换为对象,请参阅此处.你可以通过使用缓解df.convert_objects().