Scikit Logistic 回归汇总输出？

Question

Scikit Logistic 回归汇总输出？

The*_*ude 8 python scikit-learn statsmodels

有没有办法像 statsmodels 一样为 scikit 逻辑回归模型提供类似的、不错的输出？有了所有的 p 值，标准。一张表中的错误等？

Answer 1

正如您和其他人所指出的，这是 scikit learn 的局限性。在下面讨论您的问题的 scikit 方法之前， \xe2\x80\x9cbest\xe2\x80\x9d 选项是使用 statsmodels ，如下所示：

\n

import statsmodels.api as sm \nsmlog = sm.Logit(y,sm.add_constant(X)).fit()\nsmlog.summary()\n

Run Code Online (Sandbox Code Playgroud)\n

X 代表您的输入特征/预测矩阵，y 代表结果变量。如果 X 缺乏高度相关的特征，缺乏低方差特征，特征不\xe2\x80\x99t 生成 \xe2\x80\x9c 完美/准完美分离\xe2\x80\x9d，并且任何分类特征都是减少到 \xe2\x80\x9cn-1\xe2\x80\x9d 级别，即虚拟编码（而不是 \xe2\x80\x9cn\xe2\x80\x9d 级别，即，one-hot 编码，如下所述：虚拟变量陷阱）。

\n

然而，如果上述方法不可行/不实用，则可以在下面编写一种 scikit 方法以获得相当等效的结果 - 就特征系数/赔率及其标准误差和 95% CI 估计而言。本质上，代码根据不同的逻辑回归 scikit 模型生成这些结果，这些模型针对不同的数据测试训练分割进行训练。再次确保分类特征被虚拟编码为 n-1 级（否则你的 scikit 系数对于分类特征来说将是不正确的）。

\n

 #Instantiate logistic regression model with regularization turned OFF\nlog_nr = LogisticRegression(fit_intercept = True, penalty \n= "none")\n\n##Generate 5 distinct random numbers - as random seeds for 5 test-train splits\nimport random\nrandomlist = random.sample(range(1, 10000), 5)\n\n##Create features column \ncoeff_table = pd.DataFrame(X.columns, columns=["features"])\n\n##Assemble coefficients over logistic regression models on 5 random data splits\n#iterate over random states while keeping track of `i`\nfrom sklearn.model_selection import train_test_split\nfor i, state in enumerate(randomlist):\n    train_x, test_x, train_y, test_y = train_test_split(X, y,   stratify=y, \n    test_size=0.3, random_state=state) #5 test-train splits\n    log_nr.fit(train_x, train_y) #fit logistic model \n    coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_) \n\n##Calculate mean and std error for model coefficients (from 5 models above)\ncoeff_table["mean_coeff"] = coeff_table.mean(axis=1)\ncoeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)    \n\n#Calculate 95% CI intervals for feature coefficients\ncoeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]\ncoeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] - \ncoeff_table["95ci_se_coeff"]\ncoeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] + \ncoeff_table["95ci_se_coeff"]\n

Run Code Online (Sandbox Code Playgroud)\n

最后，（可选）通过如下求幂将系数转换为赔率。优势比是我最喜欢的逻辑回归输出，它们使用下面的代码附加到您的数据帧中。

\n

#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature\ncoeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])\ncoeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])\ncoeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])\n

Run Code Online (Sandbox Code Playgroud)\n

这个答案建立在 @pciunkiewicz 的一些相关回复的基础上：Collate model Coefficients across multiple test-train splits from sklearn

\n

归档时间：	9 年，6 月前
查看次数：	1160 次
最近记录：	4 年，10 月前