gus*_*ago 6 python apply dataframe pandas
我有以下数据框架C.
>>> C
a b c
2011-01-01 0 0 NaN
2011-01-02 41 12 NaN
2011-01-03 82 24 NaN
2011-01-04 123 36 NaN
2011-01-05 164 48 NaN
2011-01-06 205 60 2
2011-01-07 246 72 4
2011-01-08 287 84 6
2011-01-09 328 96 8
2011-01-10 369 108 10
Run Code Online (Sandbox Code Playgroud)
我想d在一个固定的窗口(这里是6)上添加一个新列,我在其中应用滚动函数,我在某种程度上,对于每一行(或日期),修复该值c.这个滚动函数中的一个循环应该是(伪):
a b c d
2011-01-01 0 0 NaN a + b*2 (a,b from this row, '2' is from 'c' on 2011-01-06)
2011-01-02 41 12 NaN a + b*2 (a,b from this row, '2' is still from 2011-01-06)
2011-01-03 82 24 NaN a + b*2
2011-01-04 123 36 NaN a + b*2
2011-01-05 164 48 NaN a + b*2
2011-01-06 205 60 2 a + b*2
2011-01-07 246 72 4
2011-01-08 287 84 6
2011-01-09 328 96 8
2011-01-10 369 108 10
Run Code Online (Sandbox Code Playgroud)
在这个"循环"之后,我想要将所有这6个计算出的行放入d并运行一个函数调用,然后函数调用将返回一个值,该值应存储在另一列中,e例如:
a b c d e
2011-01-01 0 0 NaN a + b*2 ---| NaN
2011-01-02 41 12 NaN a + b*2 | NaN
2011-01-03 82 24 NaN a + b*2 | These values NaN
2011-01-04 123 36 NaN a + b*2 | are input to NaN
2011-01-05 164 48 NaN a + b*2 | function NaN
2011-01-06 205 60 2 a + b*2 ---| yielding X
2011-01-07 246 72 4 value X in
2011-01-08 287 84 6 column 'e'
2011-01-09 328 96 8
2011-01-10 369 108 10
Run Code Online (Sandbox Code Playgroud)
然后将此过程迭代到下一个窗口(再次为6长),如:
a b c d e
2011-01-01 0 0 NaN
2011-01-02 41 12 NaN a + b*4 (a,b from this row, '4' is from 'c' now from 2011-01-07)
2011-01-03 82 24 NaN a + b*4 (a,b from this row, '4' is still from 2011-01-07)
2011-01-04 123 36 NaN a + b*4
2011-01-05 164 48 NaN a + b*4
2011-01-06 205 60 2 a + b*4 X
2011-01-07 246 72 4 a + b*4
2011-01-08 287 84 6
2011-01-09 328 96 8
2011-01-10 369 108 10
a b c d e
2011-01-01 0 0 NaN NaN
2011-01-02 41 12 NaN a + b*4 ---| NaN
2011-01-03 82 24 NaN a + b*4 | These values NaN
2011-01-04 123 36 NaN a + b*4 | are input to NaN
2011-01-05 164 48 NaN a + b*4 | function NaN
2011-01-06 205 60 2 a + b*4 | yielding X
2011-01-07 246 72 4 a + b*4 ---| value Y in Y
2011-01-08 287 84 6 column 'e'
2011-01-09 328 96 8
2011-01-10 369 108 10
Run Code Online (Sandbox Code Playgroud)
希望这很清楚,
谢谢,N
你可以使用pd.rolling_apply:
import numpy as np
import pandas as pd
df = pd.read_table('data', sep='\s+')
def foo(x, df):
window = df.iloc[x]
# print(window)
c = df.ix[int(x[-1]), 'c']
dvals = window['a'] + window['b']*c
return bar(dvals)
def bar(dvals):
# print(dvals)
return dvals.mean()
df['e'] = pd.rolling_apply(np.arange(len(df)), 6, foo, args=(df,))
print(df)
Run Code Online (Sandbox Code Playgroud)
产量
a b c e
2011-01-01 0 0 NaN NaN
2011-01-02 41 12 NaN NaN
2011-01-03 82 24 NaN NaN
2011-01-04 123 36 NaN NaN
2011-01-05 164 48 NaN NaN
2011-01-06 205 60 2 162.5
2011-01-07 246 72 4 311.5
2011-01-08 287 84 6 508.5
2011-01-09 328 96 8 753.5
2011-01-10 369 108 10 1046.5
Run Code Online (Sandbox Code Playgroud)
在args和kwargs参数添加到rolling_apply在大熊猫版本0.14.0.
因为在我上面的例子中df是一个全局变量,所以没有必要将它foo作为参数传递给它.您可以简单地df从该def
foo行中删除,也可以args=(df,)在调用中省略rolling_apply.
但是,有时df可能无法在可访问的范围中定义foo.在这种情况下,有一个简单的解决方法 - 制作一个闭包:
def foo(df):
def inner_foo(x):
window = df.iloc[x]
# print(window)
c = df.ix[int(x[-1]), 'c']
dvals = window['a'] + window['b']*c
return bar(dvals)
return inner_foo
df['e'] = pd.rolling_apply(np.arange(len(df)), 6, foo(df))
Run Code Online (Sandbox Code Playgroud)