Log*_*ang 5 python numpy statsmodels
我正在 Lalonde 数据集上运行逻辑回归来估计倾向得分。我使用了logit函数 fromstatsmodels.statsmodels.formula.api并将协变量包裹起来,C()使它们成为分类变量。将age和educ视为连续变量会导致成功收敛,但将它们设为分类变量会增加误差
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.617306
Iterations: 35
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-29-bae905b632a4> in <module>
----> 1 psmodel = fsms.logit('treatment ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)', tdf).fit()
2 tdf['ps'] = psmodel.predict()
3 tdf.head()
~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
1832 bnryfit = super(Logit, self).fit(start_params=start_params,
1833 method=method, maxiter=maxiter, full_output=full_output,
-> 1834 disp=disp, callback=callback, **kwargs)
1835
1836 discretefit = LogitResults(self, bnryfit)
~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
218 mlefit = super(DiscreteModel, self).fit(start_params=start_params,
219 method=method, maxiter=maxiter, full_output=full_output,
--> 220 disp=disp, callback=callback, **kwargs)
221
222 return mlefit # up to subclasses to wrap results
~/venv/lib/python3.7/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
471 Hinv = cov_params_func(self, xopt, retvals)
472 elif method == 'newton' and full_output:
--> 473 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
474 elif not skip_hessian:
475 H = -1 * self.hessian(xopt)
~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in inv(a)
549 signature = 'D->D' if isComplexType(t) else 'd->d'
550 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
552 return wrap(ainv.astype(result_t, copy=False))
553
~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
95
96 def _raise_linalgerror_singular(err, flag):
---> 97 raise LinAlgError("Singular matrix")
98
99 def _raise_linalgerror_nonposdef(err, flag):
LinAlgError: Singular matrix
Run Code Online (Sandbox Code Playgroud)
要重现,请加载Lalonde 数据集(您可以从 R 写入 csv data(lalonde))并运行以下代码
import numpy as np
import pandas as pd
from statsmodels.formula import api as fsms
filename = 'lalonde.csv'
df = pd.read_csv(filename)
tdf = df.drop(['re74', 're75', 'u74', 'u75'], axis=1)
formula = 'treat ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)'
psmodel = fsms.logit(formula, tdf).fit()
Run Code Online (Sandbox Code Playgroud)
不知道为什么在训练过程中它无法收敛/达到奇异的 Hessian 矩阵。
有趣的是,我在网上找到的一些关于因果推理和 lalonde 数据集的例子并没有将变量分类,这对我来说毫无意义。一个例子是Microsoft DoWhy,它使用 sklearn 开箱即用的 LogisticRegression。它似乎没有将变量编码为分类变量。
还有其他类似的示例,涉及在 Lalonde 数据集上运行逻辑回归而不将变量分类。这些数据在数据中是数字,但不应将这些值视为连续的。至少我觉得它们应该被放入垃圾箱,如果不是每个值一个类别的话。但这是一个不同的问题,在 CrossValidated 上更合适。有人可以帮助我理解为什么会出现此错误以及消除它的正确方法是什么?
我在运行以下逻辑模型时遇到此错误。
results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2 + const', data=df).fit()
print(results.summary())
Run Code Online (Sandbox Code Playgroud)
在检查了每个变量之后,我发现其中一个实际上是一个常量。
df.const.value_counts()
1 100000
Name: Targeted, dtype: int64
Run Code Online (Sandbox Code Playgroud)
哎呀!
将其移除后,
results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2', data=df).fit()
print(results.summary())
Run Code Online (Sandbox Code Playgroud)
物流模型按预期运行。