sch*_*dge 6 python loops regression numpy scipy
请注意,这不是关于多元回归的问题,而是在Python/NumPy(2.7)中多次执行简单(单变量)回归的问题.
我有两个m x n阵列x和y.行彼此对应,并且每对是用于测量的(x,y)点的集合.也就是说,plt.plot(x.T, y.T, '.')将绘制m个数据集/测量值中的每一个.
我想知道执行m线性回归的最佳方法是什么.目前我循环遍历行并使用scipy.stats.linregress().(假设我不希望基于对矩阵进行线性代数的解决方案,而是希望使用此函数或等效的黑盒函数.)我可以尝试np.vectorize,但文档表明它也循环.
通过一些实验,我还找到了一种方法来使用列表推导map()并获得正确的结果.我把两种解决方案都放在了下面.在IPython中,`%% timeit``使用一个小数据集(注释掉)返回:
(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop
Run Code Online (Sandbox Code Playgroud)
为了尝试放大这个,我做了一个更大的随机数据集(维度trialsx trials):
(loop, trials = 1000) 1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000) 1 loops, best of 3: 256 ms per loop
(map, trials = 10000) 1 loops, best of 3: 2.37 s per loop
Run Code Online (Sandbox Code Playgroud)
这在一个非常大的集合上是一个不错的加速,但我期待更多.有没有更好的办法?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
yrow = y[k,:]
slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)
Run Code Online (Sandbox Code Playgroud)
单变量线性回归非常简单,可以手动对其进行矢量化:
def multiple_linregress(x, y):
x_mean = np.mean(x, axis=1, keepdims=True)
x_norm = x - x_mean
y_mean = np.mean(y, axis=1, keepdims=True)
y_norm = y - y_mean
slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
np.einsum('ij,ij->i', x_norm, x_norm))
intercept = y_mean[:, 0] - slope * x_mean[:, 0]
return np.column_stack((slope, intercept))
Run Code Online (Sandbox Code Playgroud)
一些虚构的数据:
m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)
Run Code Online (Sandbox Code Playgroud)
它远远优于你的循环选项:
%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3189 次 |
| 最近记录: |