huo*_*dui 17 python linear-regression scikit-learn
我想获得线性回归结果的置信区间。我正在使用波士顿房价数据集。
我发现了这个问题: 如何计算 python 线性回归模型中斜率的 99% 置信区间? 但是,这并不能完全回答我的问题。
这是我的代码:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# import the data
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)
plt.scatter(Y_test, y_test_predict)
plt.show()
Run Code Online (Sandbox Code Playgroud)
例如,我怎样才能从中得到 95% 或 99% 的置信区间?是否有某种内置函数或代码?
cot*_*ail 10
如果您想要计算回归参数的置信区间,一种方法是使用LinearRegressionscikit-learn 和 numpy 方法的结果手动计算它。
下面的代码计算 95% 置信区间 ( alpha=0.05)。alpha=0.01将计算 99% 置信区间等。
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)
conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)
Run Code Online (Sandbox Code Playgroud)
使用波士顿住房数据集,上述代码生成以下数据框:
如果手动代码太多,您可以随时求助于statsmodels并使用它的conf_int方法:
import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)
Run Code Online (Sandbox Code Playgroud)
由于它使用相同的公式,因此会产生与上面相同的输出。
方便的包装函数:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
def get_conf_int(alpha, lr, X=X_train, y=Y_train):
"""
Returns (1-alpha) 2-sided confidence intervals
for sklearn.LinearRegression coefficients
as a pandas DataFrame
"""
coefs = np.r_[[lr.intercept_], lr.coef_]
X_aux = X.copy()
X_aux.insert(0, 'const', 1)
dof = -np.diff(X_aux.shape)[0]
mse = np.sum((y - lr.predict(X)) ** 2) / dof
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
t_val = stats.t.isf(alpha/2, dof)
gap = t_val * np.sqrt(mse * var_params)
return pd.DataFrame({
'lower': coefs - gap, 'upper': coefs + gap
}, index=X_aux.columns)
# for 95% confidence interval; use 0.01 for 99%-CI.
alpha = 0.05
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
get_conf_int(alpha, lin_model, X_train, Y_train)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
25879 次 |
| 最近记录: |