Ana*_*ngh 4 python machine-learning random-forest scikit-learn
对于下面的代码,我的r平方得分是负面的,但我使用k折交叉验证的准确度得分为92%.这怎么可能?我使用随机森林回归算法来预测一些数据.数据集的链接在以下链接中给出:https: //www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
Run Code Online (Sandbox Code Playgroud)
des*_*aut 11
你的问题有几个问题......
对于初学者来说,你正在做一个很基本的错误:你认为你正在使用精度指标,而你在回归设置,下面实际使用的度量是均方误差(MSE).
准确性是分类中使用的度量,它与正确分类的示例的百分比有关 - 请查看Wikipedia条目以获取更多详细信息.
您选择的回归程序(随机森林)内部使用的度量标准包含在regressor.fit(x_train, y_train)
命令的详细输出中- 请注意criterion='mse'
参数:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
Run Code Online (Sandbox Code Playgroud)
MSE是一个正的连续数量,它不是上限为1,即如果你得到0.92的值,这意味着......好吧,0.92,而不是 92%.
知道这一点,最好明确包含MSE作为交叉验证的评分函数:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
Run Code Online (Sandbox Code Playgroud)
出于所有实际目的,这是零 - 你几乎完美地适应训练集; 为了确认,这里是你训练集上的(完美再次)R平方分数:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
Run Code Online (Sandbox Code Playgroud)
但是,与往常一样,当您在测试集上应用模型时,真实的时刻就来了; 你的第二个错误是,既然你用缩放训练你的回归量y_train
,你也应该y_test
在评估之前进行缩放:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
Run Code Online (Sandbox Code Playgroud)
你在测试集中得到一个非常好的R平方(接近1).
那么MSE呢?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051
Run Code Online (Sandbox Code Playgroud)