在sklearn的LinearRegression方法中,fit_intercept参数究竟在做什么？

Question

在sklearn的LinearRegression方法中,fit_intercept参数究竟在做什么？

use*_*627 17 python linear-regression scikit-learn

在该sklearn.linear_model.LinearRegression方法中,有一个参数是fit_intercept = TRUE或fit_intercept = FALSE.我想知道我们是否将其设置为TRUE,是否在数据集中添加了所有1的额外拦截列？如果我已经有一个列为1的数据集,是否fit_intercept = FALSE考虑到了这一点,还是强制它适合零拦截模型？

更新:似乎人们没有得到我的问题.问题基本上是,如果我在预测变量数据集中已经有一列1(1是截距).然后,

1)如果我使用fit_intercept = FALSE,它会删除1的列吗？

2)如果我使用fit_intercept = TRUE,它会添加1的EXTRA列吗？

Answer 1

Jar*_*rad 22

fit_intercept=False将y轴截距设为0.如果fit_intercept=True,y轴截距将由最佳拟合线确定.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

bias = 100

X = np.arange(1000).reshape(-1,1)
y_true = np.ravel(X.dot(0.3) + bias)
noise = np.random.normal(0, 60, 1000)
y = y_true + noise

lr_fi_true = LinearRegression(fit_intercept=True)
lr_fi_false = LinearRegression(fit_intercept=False)

lr_fi_true.fit(X, y)
lr_fi_false.fit(X, y)

print('Intercept when fit_intercept=True : {:.5f}'.format(lr_fi_true.intercept_))
print('Intercept when fit_intercept=False : {:.5f}'.format(lr_fi_false.intercept_))

lr_fi_true_yhat = np.dot(X, lr_fi_true.coef_) + lr_fi_true.intercept_
lr_fi_false_yhat = np.dot(X, lr_fi_false.coef_) + lr_fi_false.intercept_

plt.scatter(X, y, label='Actual points')
plt.plot(X, lr_fi_true_yhat, 'r--', label='fit_intercept=True')
plt.plot(X, lr_fi_false_yhat, 'r-', label='fit_intercept=False')
plt.legend()

plt.vlines(0, 0, y.max())
plt.hlines(bias, X.min(), X.max())
plt.hlines(0, X.min(), X.max())

plt.show()

Run Code Online (Sandbox Code Playgroud)

此示例打印:

Intercept when fit_intercept=True : 100.32210
Intercept when fit_intercept=False : 0.00000

Run Code Online (Sandbox Code Playgroud)

在视觉上它变得清晰是什么fit_intercept.何时fit_intercept=True,允许最佳拟合线"适合"y轴(在此示例中接近100).当fit_intercept=False,拦截被强制到原点(0,0).

如果我包含一列1或0并设置fit_intercept为True或False,会发生什么？

下面显示了如何检查这个的示例.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)
bias = 100

X = np.arange(1000).reshape(-1,1)
y_true = np.ravel(X.dot(0.3) + bias)
noise = np.random.normal(0, 60, 1000)
y = y_true + noise

# with column of ones
X_with_ones = np.hstack((np.ones((X.shape[0], 1)), X))

for b,data in ((True, X), (False, X), (True, X_with_ones), (False, X_with_ones)):
  lr = LinearRegression(fit_intercept=b)
  lr.fit(data, y)

  print(lr.intercept_, lr.coef_)

Run Code Online (Sandbox Code Playgroud)

带走:

# fit_intercept=True, no column of zeros or ones
104.156765787 [ 0.29634031]
# fit_intercept=False, no column of zeros or ones
0.0 [ 0.45265361]
# fit_intercept=True, column of zeros or ones
104.156765787 [ 0.          0.29634031]
# fit_intercept=False, column of zeros or ones
0.0 [ 104.15676579    0.29634031]

Run Code Online (Sandbox Code Playgroud)

如果我已经在我的一组预测变量列中包含了一个列，如果我使用 TRUE 然后 FALSE 拟合它会发生什么？ (2认同)
感谢您发布此答案。实际上，我不明白的是为什么我们应该有 fit_intercept = False 的选项？拟合截距不是总是更好吗？ (2认同)

归档时间：	8 年，2 月前
查看次数：	9025 次
最近记录：	6 年，2 月前