The*_*r23 8 python performance numpy pandas
我有一个时间序列"Ser",我想用滚动窗口计算波动率(标准偏差).我当前的代码正确地以这种形式执行:
w=10
for timestep in range(length):
subSer=Ser[timestep:timestep+w]
mean_i=np.mean(subSer)
vol_i=(np.sum((subSer-mean_i)**2)/len(subSer))**0.5
volList.append(w_i)
Run Code Online (Sandbox Code Playgroud)
这在我看来非常低效.Pandas是否具有内置功能来执行此类操作?
Mad*_*ist 16
看起来你正在寻找Series.rolling.您可以将std计算应用于结果对象:
roller = Ser.rolling(w)
volList = roller.std(ddof=0)
Run Code Online (Sandbox Code Playgroud)
如果您不打算再次使用滚动窗口对象,可以编写一个单行程序:
volList = Ser.rolling(w).std(ddof=0)
Run Code Online (Sandbox Code Playgroud)
请记住,ddof=0在这种情况下是必要的,因为标准差的标准化是由len(Ser)-ddof,并且ddof默认为1大熊猫.
mcg*_*uip 11
即使在财务意义上,“波动性”也是模棱两可的。最常用的波动率类型是已实现波动率,它是已实现方差的平方根。与回报标准差的主要区别是:
有多种计算实际波动率的方法;但是,我已经实现了以下两个最常见的:
import numpy as np
window = 21 # trading days in rolling window
dpy = 252 # trading days per year
ann_factor = days_per_year / window
df['log_rtn'] = np.log(df['price']).diff()
# Var Swap (returns are not demeaned)
df['real_var'] = np.square(df['log_rtn']).rolling(window).sum() * ann_factor
df['real_vol'] = np.sqrt(df['real_var'])
# Classical (returns are demeaned, dof=1)
df['real_var'] = df['log_rtn'].rolling(window).var() * ann_factor
df['real_vol'] = np.sqrt(df['real_var'])
Run Code Online (Sandbox Code Playgroud)
通常,[金融类型]人们以价格变动百分率的年度价格来引用波动率。
假设您在数据框中有每日价格,df并且一年中有252个交易日,则可能需要以下内容:
df.pct_change().rolling(window_size).std()*(252**0.5)
这是一种 NumPy 方法 -
# From http://stackoverflow.com/a/14314054/3293881 by @Jaime
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
# From http://stackoverflow.com/a/40085052/3293881
def strided_app(a, L, S=1 ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
def rolling_meansqdiff_numpy(a, w):
A = strided_app(a, w)
B = moving_average(a,w)
subs = A-B[:,None]
sums = np.einsum('ij,ij->i',subs,subs)
return (sums/w)**0.5
Run Code Online (Sandbox Code Playgroud)
样品运行 -
In [202]: Ser = pd.Series(np.random.randint(0,9,(20)))
In [203]: rolling_meansqdiff_loopy(Ser, w=10)
Out[203]:
[2.6095976701399777,
2.3000000000000003,
2.118962010041709,
2.022374841615669,
1.746424919657298,
1.7916472867168918,
1.3000000000000003,
1.7776388834631178,
1.6852299546352716,
1.6881943016134133,
1.7578395831246945]
In [204]: rolling_meansqdiff_numpy(Ser.values, w=10)
Out[204]:
array([ 2.60959767, 2.3 , 2.11896201, 2.02237484, 1.74642492,
1.79164729, 1.3 , 1.77763888, 1.68522995, 1.6881943 ,
1.75783958])
Run Code Online (Sandbox Code Playgroud)
运行时测试
循环方法 -
def rolling_meansqdiff_loopy(Ser, w):
length = Ser.shape[0]- w + 1
volList= []
for timestep in range(length):
subSer=Ser[timestep:timestep+w]
mean_i=np.mean(subSer)
vol_i=(np.sum((subSer-mean_i)**2)/len(subSer))**0.5
volList.append(vol_i)
return volList
Run Code Online (Sandbox Code Playgroud)
时间——
In [223]: Ser = pd.Series(np.random.randint(0,9,(10000)))
In [224]: %timeit rolling_meansqdiff_loopy(Ser, w=10)
1 loops, best of 3: 2.63 s per loop
# @Mad Physicist's vectorized soln
In [225]: %timeit Ser.rolling(10).std(ddof=0)
1000 loops, best of 3: 380 µs per loop
In [226]: %timeit rolling_meansqdiff_numpy(Ser.values, w=10)
1000 loops, best of 3: 393 µs per loop
Run Code Online (Sandbox Code Playgroud)
7000x使用两种矢量化方法比循环方法更接近那里的加速!
| 归档时间: |
|
| 查看次数: |
15229 次 |
| 最近记录: |