为什么熊猫滚动使用单维ndarray

piR*_*red 16 python group-by numpy pandas pandas-groupby

我有动力使用pandas rolling功能来执行滚动多因素回归(这个问题不是关于滚动多因素回归).我希望我能够apply在a之后使用df.rolling(2)并将得到的pd.DataFrame提取物与ndarray一起使用.values并执行必要的矩阵乘法.它没有那么成功.

这是我发现的:

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
X = np.random.rand(2, 1).round(2)

Run Code Online (Sandbox Code Playgroud)

对象是什么样的:

print "\ndf = \n", df
print "\nX = \n", X
print "\ndf.shape =", df.shape, ", X.shape =", X.shape

df = 
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

X = 
[[ 0.93]
 [ 0.83]]

df.shape = (5, 2) , X.shape = (2L, 1L)

Run Code Online (Sandbox Code Playgroud)

矩阵乘法表现正常:

df.values.dot(X)

array([[ 0.7495],
       [ 0.8179],
       [ 0.4444],
       [ 1.4711],
       [ 1.3562]])

Run Code Online (Sandbox Code Playgroud)

使用apply逐行执行点产品的行为符合预期:

df.apply(lambda x: x.values.dot(X)[0], axis=1)

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

Run Code Online (Sandbox Code Playgroud)

Groupby - > Apply的行为与我期望的一样:

df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

Run Code Online (Sandbox Code Playgroud)

但是当我跑步时:

df.rolling(1).apply(lambda x: x.values.dot(X))

Run Code Online (Sandbox Code Playgroud)

我明白了:

AttributeError:'numpy.ndarray'对象没有属性'values'

好吧,所以熊猫ndarray在其rolling实施中直接使用.我能解决这个问题.让我们尝试:而不是.values用来获取ndarray,

df.rolling(1).apply(lambda x: x.dot(X))

Run Code Online (Sandbox Code Playgroud)

形状(1,)和(2,1)未对齐:1(暗0)!= 2(暗0)

等待!什么？!

所以我创建了一个自定义函数来查看正在进行的操作.

def print_type_sum(x):
    print type(x), x.shape
    return x.sum()

Run Code Online (Sandbox Code Playgroud)

然后跑了:

print df.rolling(1).apply(print_type_sum)

<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

Run Code Online (Sandbox Code Playgroud)

我的结果pd.DataFrame是一样的,这很好.但它打印出10个单维ndarray物体.关于什么rolling(2)

print df.rolling(2).apply(print_type_sum)

<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
      A     B
0   NaN   NaN
1  0.90  0.88
2  0.92  0.49
3  1.31  0.84
4  1.63  1.58

Run Code Online (Sandbox Code Playgroud)

同样的事情,期待输出,但它打印8个ndarray对象. 为每一列rolling产生一维ndarray长度window,而不是我所期望的那种ndarray形状(window, len(df.columns)).

问题是为什么？

我现在没有办法轻松运行滚动多因素回归.

我想分享我为解决这个问题所做的工作.

给定一个pd.DataFrame窗口,我生成一个堆叠ndarray使用np.dstack(见答案).然后我将其转换为a pd.Panel并使用pd.Panel.to_frame将其转换为a pd.DataFrame.此时,我pd.DataFrame的索引相对于原始索引具有附加级别,pd.DataFrame而新级别包含有关每个滚动时段的信息.例如,如果滚动窗口为3,则新索引级别将包含be [0, 1, 2].每个时期的项目.我现在可以groupby level=0返回groupby对象了.这现在给了我一个我可以更直观地操纵的对象.

滚动功能

import pandas as pd
import numpy as np

def roll(df, w):
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    panel = pd.Panel(roll_array, 
                     items=df.index[w-1:],
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
    return panel.to_frame().unstack().T.groupby(level=0)

Run Code Online (Sandbox Code Playgroud)

示范

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

print df

      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

Run Code Online (Sandbox Code Playgroud)

让我们 sum

rolled_df = roll(df, 2)

print rolled_df.sum()

major     A     B
1      0.90  0.88
2      0.92  0.49
3      1.31  0.84
4      1.63  1.58

Run Code Online (Sandbox Code Playgroud)

为了窥视引擎盖,我们可以看到结构:

print rolled_df.apply(lambda x: x)

major      A     B
  roll            
1 0     0.44  0.41
  1     0.46  0.47
2 0     0.46  0.47
  1     0.46  0.02
3 0     0.46  0.02
  1     0.85  0.82
4 0     0.85  0.82
  1     0.78  0.76

Run Code Online (Sandbox Code Playgroud)

但是我建立这个目标的原因是什么,滚动多因素回归.但是现在我会接受矩阵乘法.

X = np.array([2, 3])

print rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 

      0     1
1  2.11  2.33
2  2.33  0.98
3  0.98  4.16
4  4.16  3.84

Run Code Online (Sandbox Code Playgroud)

使用strides views concept on dataframe,这是一个矢量化的方法 -

get_sliding_window(df, 2).dot(X) # window size = 2

Run Code Online (Sandbox Code Playgroud)

运行时测试 -

In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

In [102]: X = np.array([2, 3])

In [103]: rolled_df = roll(df, 2)

In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
100 loops, best of 3: 5.51 ms per loop

In [105]: %timeit get_sliding_window(df, 2).dot(X)
10000 loops, best of 3: 43.7 µs per loop

Run Code Online (Sandbox Code Playgroud)

验证结果 -

In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
Out[106]: 
      0     1
1  2.70  4.09
2  4.09  2.52
3  2.52  1.78
4  1.78  3.50

In [107]: get_sliding_window(df, 2).dot(X)
Out[107]: 
array([[ 2.7 ,  4.09],
       [ 4.09,  2.52],
       [ 2.52,  1.78],
       [ 1.78,  3.5 ]])

Run Code Online (Sandbox Code Playgroud)

那里有巨大的改进,我希望在更大的阵列上保持显着!

归档时间：	10 年前
查看次数：	4389 次
最近记录：	7 年，9 月前

使用rolling_apply进行pandas的Python自定义函数 10

滚动窗口的数据帧表示 8

更多相关链接

如何正确地子类化dict并覆盖__getitem__&__setitem__ 75

Python配置文件:任何文件格式推荐？INI格式还合适吗？看起来很老派 61

Python:区分行和列向量 61

在numpy中乘以对数概率矩阵的数值稳定方法 34

Pandas DataFrame使用散景或matplotlib的分层饼图/圆环图 10

pandas groupby抵消了不同的开始 7

将pandas列转换为datetime64,包括缺失值 5

不止一个用于lambdify的模块 5

python结构化/重新排列类型转换行为 5

在C#中使用IronPython导入numpy 5

如何测试空的JavaScript对象？ 2730

什么时候应该使用static_cast,dynamic_cast,const_cast和reinterpret_cast？ 2367

使用JavaScript/jQuery滚动到页面顶部？ 1511

如何将分离的HEAD与master/origin协调？ 1506

从Git提交中删除文件 1484

Node.js module.exports的目的是什么,你如何使用它？ 1397

如何将键/值对添加到JavaScript对象？ 1270

SOAP与REST(差异) 1206

如何在PHP中进行重定向？ 1201

如何将参数传递给批处理文件？ 1100