通过之前训练的模型预测看不见的数据

sha*_*arp 4 python machine-learning python-3.x scikit-learn

我正在使用 Scikit-learn 执行监督机器学习。我有两个数据集。第一个数据集包含具有 X 特征和 Y 标签的数据。第二个数据集仅包含 X 个特征,但没有 Y 标签。我可以成功对训练/测试数据执行 LinearSVC 并获取测试数据集的 Y 标签。

现在,我想使用为第一个数据集训练的模型来预测第二个数据集标签。如何在 Scikit-learn 中使用从第一个数据集到第二个数据集(看不见的标签)的预训练模型?

我尝试的代码片段: 以下评论中的更新代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle


# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
             'Harry potter book is awesome. It rocks',
             'Nutrition is very important',
             'Welcome to library, you can find as many book as you like',
             'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]

# books = 1 : y label
# food = 0 : y label

df = pd.DataFrame({'text':some_text,
                   'y_variable': y_variable
                          })

# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable
features.shape


# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
                                                 labels,
                                                 train_size=0.5,
                                                 random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))


# ----------- Dataset 2: UNSEEN DATASET ----------- #

some_text2 = ['Harry potter books are amazing',
             'Gluten free diet is getting popular']

unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.


# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: 
# ValueError: X has 11 features per sample; expecting 26


print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)


# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data. 
text                                   y_variable
Harry potter books are amazing         1
Gluten free diet is getting popular    0
Run Code Online (Sandbox Code Playgroud)

它不一定是我上面尝试过的 pickle 代码。我正在寻找是否有人有建议或者是否有任何预构建函数可以从 scikit 进行预测?

Art*_*Sbr 6

正如您所看到的,您的第一个tfidf是将输入转换为 26 个特征,而第二个tfidf是将它们转换为 11 个特征。因此会发生错误,因为X_train的形状与 不同X_unseen。提示告诉您每个观察的特征数量少于训练接收的X_unseen特征数量。model

加载model第二个脚本后,您将另一个矢量化器适合文本。也就是说,tfidf第一个脚本和tfidf第二个脚本是不同的对象。为了使用 进行预测model,您需要X_unseen使用原始的进行转换tfidf。为此,您必须导出原始矢量化器,将其加载到新脚本中并用它转换新数据,然后再将其传递到model.

### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))

# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()

# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)
Run Code Online (Sandbox Code Playgroud)