使用Iris数据集使用Python在R中再现LASSO /逻辑回归结果

Question

使用Iris数据集使用Python在R中再现LASSO /逻辑回归结果

Oli*_*lil 5 python r lasso-regression scikit-learn logistic-regression

我正在尝试在Python中重现以下R结果。在这种特殊情况下，R预测技能低于Python技能，但是根据我的经验通常不是这种情况（因此，需要在Python中重现结果的原因），因此请在此处忽略此详细信息。

目的是预测花的种类（'versicolor'0或'virginica'1）。我们有100个带有标签的样本，每个样本包含4个花特征：萼片长度，萼片宽度，花瓣长度，花瓣宽度。我已将数据分为训练（数据的60％）和测试集（数据的40％）。将10倍交叉验证应用于训练集以搜索最佳lambda（在scikit-learn中优化的参数为“ C”）。

我在R 中将glmnet的 alpha设置为1（对于LASSO惩罚），对于python，则使用scikit-learn的LogisticRegressionCV函数和“ liblinear”求解器（可以与L1罚分一起使用的唯一求解器）。交叉验证中使用的评分指标在两种语言之间是相同的。但是，模型结果有所不同（针对每个特征找到的截距和系数相差很大）。

R代码

library(glmnet)
library(datasets)
data(iris)

y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2

n_sample = NROW(X)

w = .6
X_train = X[0:(w * n_sample),]  # (60, 4)
y_train = y[0:(w * n_sample)]   # (60,)
X_test = X[((w * n_sample)+1):n_sample,]  # (40, 4)
y_test = y[((w * n_sample)+1):n_sample]   # (40,)

# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
                        nfolds = 10, alpha=1, family="binomial", type.measure="class")

best_s  <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))

# best lambda
print(best_s)
# 0.04136537

# fraction correct
print(sum(y_test==pred)/NROW(pred))   
# 0.75

# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept)  -14.680479
#Sepal.Length   0        
#Sepal.Width   0
#Petal.Length   1.181747
#Petal.Width    4.592025

Run Code Online (Sandbox Code Playgroud)

Python代码

from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

n_sample = len(X)

w = .6
X_train = X[:int(w * n_sample)]  # (60, 4)
y_train = y[:int(w * n_sample)]  # (60,)
X_test = X[int(w * n_sample):]  # (40, 4)
y_test = y[int(w * n_sample):]  # (40,)

X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.775
print clf.intercept_  #-1.83569557
print clf.coef_  # [ 0,  0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_  # optimal lambda: 0.35938137

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cra*_*aig 4

上面的例子有一些不同之处：

系数的范围
- glmnet ( https://cran.r-project.org/web/packages/glmnet/glmnet.pdf ) 对数据进行标准化，并且“系数始终按原始比例返回”。因此，在调用 glmnet 之前您没有缩放数据。
- Python 代码对数据进行标准化，然后适合该标准化数据。本例中的系数采用标准化尺度，而不是原始尺度。这使得示例之间的系数不可比较。
LogisticRegressionCV 默认使用分层折叠。glmnet 使用 k 倍。
他们正在拟合不同的方程。请注意，scikit-learn 逻辑符合 ( http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression ) 与逻辑方面的正则化。glmnet 对惩罚进行正则化。
选择要尝试的正则化强度 - glmnet 默认尝试 100 个 lambda。scikit LogisticRegressionCV 默认为 10。由于 scikit 求解方程，范围在 1e-4 和 1e4 之间（http://scikit-learn.org/stable/modules/ generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model .LogisticRegressionCV）。
宽容程度不同。在我遇到的一些问题中，收紧容差会显着改变系数。
- glmnet 默认阈值为 1e-7
- LogisticRegressionCV 默认tol为 1e-4
- 即使使它们相同，它们也可能无法测量相同的东西。我不知道 liblinear 采取了什么措施。glmnet - “每个内部坐标下降循环都会继续，直到任何系数更新后目标的最大变化小于 thresh 乘以零偏差。”

您可能想尝试打印正则化路径以查看它们是否非常相似，只是停止在不同的强度上。然后你可以研究一下原因。

即使更改了您可以更改的内容（并非上述全部内容），您也可能不会获得相同的系数或结果。尽管您在不同的软件中解决相同的问题，但软件解决问题的方式可能会有所不同。我们看到不同的尺度、不同的方程、不同的默认值、不同的求解器等。

归档时间：	8 年，7 月前
查看次数：	2194 次
最近记录：	8 年，7 月前