Python 线性回归模型（Pandas、statsmodels） - 值错误：endog exog 矩阵大小不匹配

Question

Python 线性回归模型（Pandas、statsmodels） - 值错误：endog exog 矩阵大小不匹配

usa*_*g1r 2 python linear-regression pandas statsmodels

我的一个朋友问我这个线性回归代码，我也无法解决，所以现在也是我的问题。

我们得到的错误： ValueError：endog 和 exog 矩阵的大小不同

当我从 ind_names 中删除“Tech”时，它工作正常。这可能毫无意义，但为了消除语法错误的可能性，我尝试这样做。

技术和金融行业标签在 DataFrame 中分布不均，所以这可能导致大小不匹配？但我无法进一步调试，所以决定问你们。

对错误和解决方案的想法得到一些确认真是太好了。请在下面找到代码。

    #We have a portfolio constructed of 3 randomly generated factors (fac1, fac2, fac3). 
#Python code provides the following message 
#ValueError: The indices for endog and exog are not aligned

import pandas as pd
from numpy.random import rand
import numpy as np
import statsmodels.api as sm

fac1, fac2, fac3 = np.random.rand(3, 1000) #Generate  random factors

#Consider a collection of hypothetical stock portfolios
#Generate randomly 1000 tickers
import random; random.seed(0)
import string
N = 1000
def rands(n):
  choices = string.ascii_uppercase
  return ''.join([random.choice(choices) for _ in range(n)])


tickers = np.array([rands(5) for _ in range(N)])
ticker_subset = tickers.take(np.random.permutation(N)[:1000])

#Weighted sum of factors plus noise

port = pd.Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000), index=ticker_subset)
factors = pd.DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3}, index=ticker_subset)

#Correlations between each factor and the portfolio 
#print(factors.corrwith(port))
factors1=sm.add_constant(factors)


#Calculate factor exposures using a regression estimated by OLS
#print(sm.OLS(np.asarray(port), np.asarray(factors1)).fit().params)

#Calculate the exposure on each industry
def beta_exposure(chunk, factors=None):
    return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params


#Assume that we have only two industries – financial and tech

ind_names = np.array(['Financial', 'Tech'])
#Create a random industry classification 

sampler = np.random.randint(0, len(ind_names), N)
industries = pd.Series(ind_names[sampler], index=tickers, name='industry')
by_ind = port.groupby(industries)



exposures=by_ind.apply(beta_exposure, factors=factors1)
print(exposures)
#exposures.unstack()

#Determinate the exposures on each industry

Run Code Online (Sandbox Code Playgroud)

Answer 1

Zev*_*Zev 6

理解错误信息：

ValueError：endog 和 exog 矩阵的大小不同

好吧，还不错。内源矩阵和外源矩阵大小不同。该模块提供了这个页面，它告诉我们内生是系统内的因素，外生是系统外的因素。

一些调试

检查我们为我们的阵列得到了什么形状。要做到这一点，我们需要拆开单行并打印.shape参数，或者打印每个参数的第一把。另外，注释掉抛出错误的行。所以在那里，我们发现我们得到：

chunk [490]
factor [1000    4]
chunk [510]
factor [1000    4]

Run Code Online (Sandbox Code Playgroud)

哦！就是这样。我们也期待因子被分块。第一次应该是 [490 4]，第二次应该是 [510 4]。注意：由于类别是随机分配的，因此每次都会有所不同。

所以基本上我们在那个函数中有太多的信息。我们可以使用块来查看要选择哪些因素，过滤掉那些因素，然后一切都会起作用。

查看文档中的函数定义：

class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)

Run Code Online (Sandbox Code Playgroud)

我们只是传递两个参数，其余的都是可选的。让我们看看我们经过的两个。

endog (array-like) – 一维内生响应变量。因变量。

exog (array-like) – 一个 nobs xk 数组，其中 nobs 是观察的数量，k 是回归器的数量......

啊，endog还有exog。endog是一维数组。到目前为止一切顺利，形状490有效。exog 贵族？哦，它的观察次数。所以它是一个二维数组，在这种情况下，我们需要 shape 490by 4。

这个具体问题：

beta_exposure 应该：

def beta_exposure(chunk, factors=None):
    factors = factors.loc[factors.index.isin(chunk.index)]
    return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params

Run Code Online (Sandbox Code Playgroud)

问题是您将 beta_exposures 应用于列表的每个部分（它是随机的，所以假设 490 个元素 forFinancial和 510 个 for Tech）但factors=factors1始终为您提供 1000 个值（groupby代码未涉及）。

请参阅http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html和http://www.statsmodels.org/dev/endog_exog.html以获取我用来研究这个的参考资料。

感谢您提供如此描述性的答案！所以我想这已经证实了不同数量的行业标签导致了价值错误。现在很高兴知道如何更好地编写此代码并消除错误。 (2认同)
乐意效劳。我认为如果将“500”除以“500”，您仍然会遇到相同的错误。这不是问题，而是您发送了大约 500 个值，而是发送了 1000 个因子（您一次发送一组，但每次发送所有因子）。但是，如果您确实希望通过“500”使其成为“500”，我尝试先执行“sampler = [0] * 500 + [1] * 500”，然后再执行“random.shuffle(sampler)”。如果我的回答有帮助，请选择它。谢谢！顺便说一句，我做了如此描述，以防其他人遇到相同的错误消息。虽然您可能已经知道我所说的一些内容，但对其他人来说可能是新的。 (2认同)
我想确认我的答案，所以我只是在您缺少的行中添加了内容。 (2认同)
当然，再次感谢！这是一个极好的答案。也感谢 500/500 的澄清。我不认为 endog 和 exog 在网络上的其他任何地方都能得到更好的解释，甚至在官方网站上也不行。 (2认同)

归档时间：	7 年，6 月前
查看次数：	5310 次
最近记录：	7 年，6 月前