将model.predict()的结果与原始pandas DataFrame合并?

bla*_*ite 12 python pandas scikit-learn

我试图将predict方法的结果与pandas.DataFrame对象中的原始数据合并.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)
Run Code Online (Sandbox Code Playgroud)

要将这些预测与原始预测合并df,我试试这个:

df['y_hats'] = y_hats
Run Code Online (Sandbox Code Playgroud)

但这引起了:

ValueError:值的长度与索引的长度不匹配

我知道我可以分裂dftrain_dftest_df而这个问题将得到解决,但在现实中,我需要按照上面的路径来创建矩阵Xy(我的实际问题是,我正常化的文本分类问题整个分成之前特征矩阵训练和测试).如何将这些预测值与我的相应行对齐df,因为该y_hats数组是零索引的,并且看似所有关于哪些行包含在中X_test并且y_test丢失的信息?或者我是否会将数据帧首先分解为列车测试,然后构建特征矩阵?我想只需填写包含在列trainnp.nan在数据帧值.

fly*_*all 17

你的y_hats长度只是测试数据的长度(20%)因为你在X_test上预测了.一旦您的模型得到验证并且您对测试预测感到满意(通过检查模型在X_test预测上与X_test真值相比的准确性),您应该在完整数据集(X)上重新运行预测.将这两行添加到底部:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2
Run Code Online (Sandbox Code Playgroud)

根据您的评论编辑,这是一个更新的结果,返回数据集,其中预测附加在测试数据集中的位置

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
Run Code Online (Sandbox Code Playgroud)

  • @flyingmeatball 嗨,我正在尝试做完全相同的事情,但是当您将 y_hats 存储为变量时,它会变成一个 numpy 数组,而不是一个需要转换为 Pandas 以进行合并的数据框。那时,无法完成索引的合并。我不确定我错过了什么? (2认同)