如何使用新数据在 sklearn 中重新训练逻辑回归模型

Question

如何使用新数据在 sklearn 中重新训练逻辑回归模型

Use*_*007 3 python machine-learning scikit-learn logistic-regression

如何在 sklearn python 中重新训练我现有的机器学习模型？

我有数千条记录，我使用这些记录训练我的模型并.pkl使用pickle. 在第一次训练模型时，我warmStart = True在创建逻辑回归对象时使用了该参数。

示例代码：

 log_regression_model =  linear_model.LogisticRegression(warm_start = True)
 log_regression_model.fit(X, Y)
 # Saved this model as .pkl file on filesystem like pickle.dump(model,open('model.pkl', wb))

Run Code Online (Sandbox Code Playgroud)

我想让它与我每天都会获得的新数据保持同步。为此，我打开现有模型文件并获取过去 24 小时的新数据并再次训练。/

示例代码：

#open the model from filesystem
log_regression_model = pickle.load(open('model.pkl','rb'))
log_regression_model.fit(X, Y) # New X, Y here is data of last 24 hours only. Few hundreds records only.

Run Code Online (Sandbox Code Playgroud)

但是，当我通过从文件系统加载模型来重新训练模型时，它似乎删除了使用数千条记录创建的现有模型，并创建了过去 24 小时内包含数百条记录的新模型（具有数千条记录的模型大小为 3MB）在文件系统上，而新的重新训练模型只有 67KB）

我试过使用warmStart 选项。如何重新训练我的 LogisticRegression 模型？

Answer 1

Jak*_*zuk 8

当您使用fit经过训练的模型时，您基本上会丢弃所有先前的信息。

Scikit-learn 有一些模型具有partial_fit可用于增量训练的方法，如文档中所示。

我不记得是否可以在 sklearn 中重新训练逻辑回归，但是 sklearn 有SGDClassifier它loss=log运行带有随机梯度下降优化的逻辑回归，并且它有partial_fit方法。

Answer 2

Jer*_*bon 1

对象的大小LogicsticRegression与用于训练它的样本数量无关。

from sklearn.linear_model import LogisticRegression
import pickle
import sys

np.random.seed(0)
X, y = np.random.randn(100000, 1), np.random.randint(2, size=(100000,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))

np.random.seed(0)
X, y = np.random.randn(100, 1), np.random.randint(2, size=(100,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))

Run Code Online (Sandbox Code Playgroud)

结果是

1230
1233

Run Code Online (Sandbox Code Playgroud)

您可能保存了错误的模型对象。确保您正在保存 log_regression_model。

pickle.dump(log_regression_model, open('model.pkl', 'wb'))

Run Code Online (Sandbox Code Playgroud)

由于模型大小如此不同，并且LogisticRegression对象不会随着不同数量的训练样本而改变其大小，因此看起来使用不同的代码来生成保存的模型和这个新的“重新训练”模型。

尽管如此，看起来 Warm_start 也没有在这里做任何事情：

np.random.seed(0)
X, y = np.random.randn(200, 1), np.random.randint(2, size=(200,))

log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X[:100], y[:100])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X, y)
print(log_regression_model.intercept_, log_regression_model.coef_)

Run Code Online (Sandbox Code Playgroud)

给出：

(array([ 0.01846266]), array([[-0.32172516]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.09707612]), array([[ 0.01501025]]))

Run Code Online (Sandbox Code Playgroud)

基于这个其他问题，warm_start如果您使用另一个求解器（例如），将会产生一些效果LogisticRegression(warm_start=True, solver='sag')，但它仍然与添加新数据后对整个数据集进行重新训练不同。例如，上述四个输出变为：

(array([ 0.01915884]), array([[-0.32176053]]))
(array([ 0.17973458]), array([[ 0.33708208]]))
(array([ 0.17968324]), array([[ 0.33707362]]))
(array([ 0.09903978]), array([[ 0.01488605]]))

Run Code Online (Sandbox Code Playgroud)

可以看到中间两行有不同，但差别不大。它所做的只是使用上一个模型的参数作为起点，使用新数据重新训练新模型。听起来您想要做的是保存数据，并在每次添加数据时结合旧数据和新数据重新训练它。

归档时间：	7 年，11 月前
查看次数：	8504 次
最近记录：	7 年，11 月前