将 RandomForestClassifier Predict_Proba 结果添加到原始数据帧

Question

将 RandomForestClassifier Predict_Proba 结果添加到原始数据帧

Pyt*_*_DK 6 python dataframe python-3.x pandas random-forest

我是我的第一个“真正的”机器学习算法的新手。抱歉，如果这是重复的，但我在 SO 上找不到答案。

我有以下数据框（df）：

index    Feature1  Feature2  Feature3  Target
001       01         01        03        0
002       03         03        01        1
003       03         02        02        1

Run Code Online (Sandbox Code Playgroud)

我的代码看起来像这样：

data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.8)

clf = RandomForestClassifier().fit(X_train, y_train)

prediction_of_probability = clf.predict_proba(X_test)

Run Code Online (Sandbox Code Playgroud)

我正在苦苦挣扎的是如何才能'prediction_of_probability'回到数据框df？

我知道预测不会适用于原始数据框中的所有项目。

预先感谢您帮助像我这样的新手！

Answer 1

Joe*_*Joe 6

你所做的是训练模型。这意味着使用您拥有的特征和标签，您可以为未来的数据训练模型。为了测试模型的质量（例如选择特征），模型在 X_test 和 y_test 上进行测试。在这种情况下，你没有未来的数据，所以你没有应用你的模型，你只是在训练它。您可以使用 AUC 或 ROC 曲线查看模型的质量。

无论如何，您可以通过这种方式将结果附加到数据框：

df_test = pd.DataFrame(X_test)
df_test['Target'] = y_test
df_test['prob_0'] = prediction_of_probability[:,0] 
df_test['prob_1'] = prediction_of_probability[:,1]

Run Code Online (Sandbox Code Playgroud)

Answer 2

Mab*_*lba 5

您可以尝试保留火车和测试的索引，然后以这种方式将它们放在一起：

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
indices = df.index.values 

# use the indices instead the labels to save the order of the split.

X_train, X_test,indices_train,indices_test = train_test_split(data,indices, test_size=0.33, random_state=42)

y_train, y_test = labels[indices_train],  labels[indices_test]


clf = RandomForestClassifier().fit(X_train, y_train)

prediction_of_probability = clf.predict_proba(X_test)

Run Code Online (Sandbox Code Playgroud)

然后你可以把概率放在新的df_new：

>>> df_new = df.copy()
>>> df_new.loc[indices_test,'pred_test'] = prediction_of_probability # clf.predict_proba(X_test)
>>> print(df_new)

   Feature1  Feature2  Feature3  Target  pred_test
1         3         3         1       1        NaN
2         3         2         2       1        NaN
0         1         1         3       0        1.0

Run Code Online (Sandbox Code Playgroud)

甚至对火车的预测：

>>> df_new.loc[indices_train,'pred_train'] = clf.predict_proba(X_train)
>>> print(df_new)

   Feature1  Feature2  Feature3  Target  pred_test  pred_train
1         3         3         1       1        NaN         1.0
2         3         2         2       1        NaN         1.0
0         1         1         3       0        1.0         NaN

Run Code Online (Sandbox Code Playgroud)

或者，如果您想混合训练和测试的概率，只需使用相同的列名（即pred）。

归档时间：	7 年，10 月前
查看次数：	8075 次
最近记录：	7 年，4 月前