线性回归中的负精度

Abe*_*ibe 1 python machine-learning linear-regression scikit-learn

我的线性回归模型具有决定系数 R²。

这怎么会发生?任何想法都有帮助。

这是我的数据集:

year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
Run Code Online (Sandbox Code Playgroud)

LinearRegression模型的代码如下:

import pandas as pd

from sklearn.linear_model import LinearRegression

data =pd.read_csv("data.csv", header=None )

data = data.drop(0,axis=0)

X=data[0]

Y=data[1]

from sklearn.model_selection import train_test_split 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)

lm = LinearRegression()

lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))

Y_pred = lm.predict(X_test.values.reshape(-1,1))

accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)

print(accuracy)
Run Code Online (Sandbox Code Playgroud)
output
-3592622948027972.5
Run Code Online (Sandbox Code Playgroud)

jos*_*ure 5

这是R²分数的公式:

R2公式

\hat{y_i} 是第 i 个观察值 y_i 的预测值,而 \bar{y} 是所有观察值的平均值。

因此,负 R² 意味着如果有人知道您的y_test样本的平均值并始终将其用作“预测”,则此“预测”将比您的模型更准确。

转到您的数据集(感谢@Prayson W. Daniel 提供方便的加载脚本),让我们快速查看您的数据。

df.population.plot()
Run Code Online (Sandbox Code Playgroud)

人口

看起来对数变换会有所帮助。

import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
Run Code Online (Sandbox Code Playgroud)

人口的对数

现在让我们使用 OpenTURNS 执行线性回归。

import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
Run Code Online (Sandbox Code Playgroud)

输出:

Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
Run Code Online (Sandbox Code Playgroud)

总体对数的线性回归

这几乎是精确的配合。

编辑

正如@Prayson W. Daniel 所建议的,这是转换回原始比例后的模型拟合。

# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)

# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))

# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
Run Code Online (Sandbox Code Playgroud)

输出:

R2 score in original scale = 0.9979032805107133
Run Code Online (Sandbox Code Playgroud)

模型适合原始比例