许多数据帧上的高效Python Pandas Stock Beta计算

cws*_*wse 13 python algorithm performance dataframe pandas

我有很多(4000+)CSV的库存数据(日期,开放,高,低,关闭),我将其导入单个Pandas数据帧以执行分析.我是python的新手,想要计算每个股票的滚动12个月测试版,我找到了一个计算滚动测试版的帖子(Python pandas使用滚动应用于矢量化方式的groupby对象来计算车辆股票beta)但是当我在下面的代码中使用时需要超过2.5小时!考虑到我可以在3分钟内在SQL表中运行完全相同的计算,这太慢了.

如何提高下面的代码的性能以匹配SQL的性能?我理解Pandas/python有这种能力.我当前的方法遍历每一行,我知道这会降低性能,但我不知道在数据帧上执行滚动窗口beta计算的任何聚合方式.

注意:将CSV加载到单个数据帧并计算每日返回的前两个步骤仅需约20秒.我的所有CSV数据帧都存储在名为"FilesLoaded"的字典中,其名称为"XAO".

非常感谢您的帮助!谢谢 :)

import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10)  #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()

MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS 
# ***********************************************************************************************
for File in FilesLoaded:
    FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
    np_array = df.values
    m = np_array[:,0] # market returns are column zero from numpy array
    s = np_array[:,1] # stock returns are column one from numpy array
    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta

#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
    if min_periods is None:
        min_periods = period
    result = pd.Series(np.nan, index=df.index)
    for i in range(1, len(df)+1):
        sub_df = df.iloc[max(i-period, 0):i,:]
        if len(sub_df) >= min_periods:  
            idx = sub_df.index[-1]
            result[idx] = func(sub_df)
    return result

#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)    
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan

for File in FilesLoaded:
    df_join['stock'].update(FilesLoaded[File]['Return'])
    df_join  = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
    df_join  = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
    df_join  = df_join.fillna(0) #get rid of the NaNs in the return data
    FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)

# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))
Run Code Online (Sandbox Code Playgroud)

piR*_*red 11


为4,000种股票生成 20年月度数据的随机库存数据

dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
Run Code Online (Sandbox Code Playgroud)
df.iloc[:5, :5]
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


Roll函数
返回groupby对象,准备应用自定义函数
请参见 Source

def roll(df, w):
    # stack df.values w-times shifted once at each stack
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    # roll_array is now a 3-D array and can be read into
    # a pandas panel object
    panel = pd.Panel(roll_array, 
                     items=df.index[w-1:],
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
    # convert to dataframe and pivot + groupby
    # is now ready for any action normally performed
    # on a groupby object
    return panel.to_frame().unstack().T.groupby(level=0)
Run Code Online (Sandbox Code Playgroud)

Beta函数
使用OLS回归的封闭形式解决方案
假设第0列是市场
参见来源

def beta(df):
    # first column is the market
    X = df.values[:, [0]]
    # prepend a column of ones for the intercept
    X = np.concatenate([np.ones_like(X), X], axis=1)
    # matrix algebra
    b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
    return pd.Series(b[1], df.columns[1:], name='Beta')
Run Code Online (Sandbox Code Playgroud)

示范

rdf = roll(df, 12)
betas = rdf.apply(beta)
Run Code Online (Sandbox Code Playgroud)

定时

在此输入图像描述


验证
将计算与OP进行比较

def calc_beta(df):
    np_array = df.values
    m = np_array[:,0] # market returns are column zero from numpy array
    s = np_array[:,1] # stock returns are column one from numpy array
    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta
Run Code Online (Sandbox Code Playgroud)
print(calc_beta(df.iloc[:12, :2]))

-0.311757542437
Run Code Online (Sandbox Code Playgroud)
print(beta(df.iloc[:12, :2]))

s0001   -0.311758
Name: Beta, dtype: float64
Run Code Online (Sandbox Code Playgroud)

注意第一个单元格
与上面验证的计算值相同

betas = rdf.apply(beta)
betas.iloc[:5, :5]
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


回复评论
使用模拟多个数据帧的完整工作示例

num_sec_dfs = 4000

cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}

market = pd.Series(np.random.rand(480), dates, name='Market')

df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)

betas = roll(df.pct_change().dropna(), 12).apply(beta)

for c, col in betas.iteritems():
    dfs[c]['Beta'] = col

dfs['s0001'].head(20)
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

  • 这看起来非常适合我需要做的事情,我想知道既然 Panel 已经被折旧,您是否能够使用 DataFrames 多级索引发布更新的解决方案。谢谢 :) (2认同)

piR*_*red 7

使用生成器提高内存效率

模拟数据

m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Run Code Online (Sandbox Code Playgroud)

Beta计算

def beta(df, market=None):
    # If the market values are not passed,
    # I'll assume they are located in a column
    # named 'Market'.  If not, this will fail.
    if market is None:
        market = df['Market']
        df = df.drop('Market', axis=1)
    X = market.values.reshape(-1, 1)
    X = np.concatenate([np.ones_like(X), X], axis=1)
    b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
    return pd.Series(b[1], df.columns, name=df.index[-1])
Run Code Online (Sandbox Code Playgroud)

roll函数
这将返回一个生成器,并且将大大提高内存效率

def roll(df, w):
    for i in range(df.shape[0] - w + 1):
        yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Run Code Online (Sandbox Code Playgroud)

放在一起

betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Run Code Online (Sandbox Code Playgroud)

验证方式

OP Beta计算

def calc_beta(df):
    np_array = df.values
    m = np_array[:,0] # market returns are column zero from numpy array
    s = np_array[:,1] # stock returns are column one from numpy array
    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta
Run Code Online (Sandbox Code Playgroud)

实验设置

m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')

cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}

market = pd.Series(np.random.rand(m), dates, name='Market')

df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)

betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T

for c, col in betas.iteritems():
    dfs[c]['Beta'] = col

dfs['s0000'].head(20)
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

calc_beta(df[['Market', 's0000']])

0.0020118230147777435
Run Code Online (Sandbox Code Playgroud)

注意:
计算方法相同