Scikit-Learn在列车/测试拆分之前或之后进行一次热编码

B_M*_*ner 10 python-2.7 scikit-learn

我正在研究使用scikit-learn构建模型的两个场景,我无法弄清楚为什么其中一个返回的结果与另一个结果根本不同.两种情况(我所知道的)之间唯一不同的是,在一种情况下,我一次性对所有分类变量进行热编码(在整个数据上),然后在训练和测试之间进行分割.在第二种情况下,我在训练和测试之间进行分割,然后根据训练数据对两组进行一次热编码.

后一种情况在技术上更好地判断过程的泛化误差,但这种情况下返回的标准化gini与第一种情况相比显着不同(和差 - 基本上没有模型).我知道第一种情况gini(~0.33)与建立在这个数据上的模型一致.

为什么第二种情况会返回如此不同的基尼?FYI数据集包含数字和分类变量的混合.

方法1(单热编码整个数据,然后拆分)返回:Validation Sample Score: 0.3454355044 (normalized gini).

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

def gini(solution, submission):
    df = zip(solution, submission, range(len(solution)))
    df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
    rand = [float(i+1)/float(len(df)) for i in range(len(df))]
    totalPos = float(sum([x[0] for x in df]))
    cumPosFound = [df[0][0]]
    for i in range(1,len(df)):
        cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
    Lorentz = [float(x)/totalPos for x in cumPosFound]
    Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
    return sum(Gini)

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    return normalized_gini

# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)



if __name__ == '__main__':

    dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
    y=dat[['Hazard']].values.ravel()
    dat=dat.drop(['Hazard','Id'],axis=1)


    folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test

    #First one hot and make a pandas df
    dat_dict=dat.T.to_dict().values()
    vectorizer = DV( sparse = False )
    vectorizer.fit( dat_dict )
    dat= vectorizer.transform( dat_dict )
    dat=pd.DataFrame(dat)


    train_X=dat.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=dat.iloc[folds[1],:]
    test_y=y[folds[1]]


    rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
Run Code Online (Sandbox Code Playgroud)

方法2(首先拆分然后单热编码)返回:Validation Sample Score: 0.0055124452 (normalized gini).

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

def gini(solution, submission):
    df = zip(solution, submission, range(len(solution)))
    df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
    rand = [float(i+1)/float(len(df)) for i in range(len(df))]
    totalPos = float(sum([x[0] for x in df]))
    cumPosFound = [df[0][0]]
    for i in range(1,len(df)):
        cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
    Lorentz = [float(x)/totalPos for x in cumPosFound]
    Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
    return sum(Gini)

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    return normalized_gini

# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)



if __name__ == '__main__':

    dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
    y=dat[['Hazard']].values.ravel()
    dat=dat.drop(['Hazard','Id'],axis=1)


    folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test

    #first split
    train_X=dat.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=dat.iloc[folds[1],:]
    test_y=y[folds[1]]

    #One hot encode the training X and transform the test X
    dat_dict=train_X.T.to_dict().values()
    vectorizer = DV( sparse = False )
    vectorizer.fit( dat_dict )
    train_X= vectorizer.transform( dat_dict )
    train_X=pd.DataFrame(train_X)

    dat_dict=test_X.T.to_dict().values()
    test_X= vectorizer.transform( dat_dict )
    test_X=pd.DataFrame(test_X)


    rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
Run Code Online (Sandbox Code Playgroud)

inv*_*ion 12

虽然之前的注释正确地建议最好先映射整个要素空间,但在您的情况下,Train和Test都包含所有列中的所有要素值.

如果比较vectorizer.vocabulary_两个版本之间的版本,它们完全相同,因此映射没有区别.因此,它不会导致问题.

究其原因方法2失败,是因为你dat_dict重新排序由原始索引,当你执行这个命令.

dat_dict=train_X.T.to_dict().values()
Run Code Online (Sandbox Code Playgroud)

换句话说,train_X有一个混乱的索引进入这行代码.当您将其转换为a时dict,dict顺序将重新排序为原始索引的数字顺序.这会导致您的训练和测试数据完全脱相关y.

方法1不会遇到此问题,因为您在映射后随机播放数据.

您可以通过添加在方法2中.reset_index()分配的两次来解决问题dat_dict,例如,

dat_dict=train_X.reset_index(drop=True).T.to_dict().values()
Run Code Online (Sandbox Code Playgroud)

这可确保在转换为a时保留数据顺序dict.

当我添加该位代码时,我得到以下结果:
- 方法1:验证样本得分:0.3454355044(标准化基尼)
- 方法2:验证样本得分:0.3438430991(标准化基尼)

  • 有人指出,“ reset_index”应包含“ drop = True”,以便字典不包含索引信息。我更新了代码以反映这一点,尽管它不会改变结果。没有`drop = True`,`dat_dict`确实会包含索引信息,但是不会被映射到任何内容,因为`train_X`中没有名为index的列。 (2认同)