scikit-learn模型持久性:pickle vs pmml vs ...?

sds*_*sds 9 python python-2.7 scikit-learn pmml

我构建了一个scikit-learn模型,我想在每日python cron作业中重用(注意:没有涉及其他平台 - 没有R,没有Java和c).

腌制它(实际上,我腌制了我自己的对象,其中一个字段是a GradientBoostingClassifier),我在cron作业中取消它.到目前为止一直这么好(并且已经在Scikit-Learn中将保存分类器讨论到磁盘在Scikit-Learn中进行模型持久性讨论了吗?).

但是,我升级了sklearn,现在我收到了这些警告:

.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator PriorProbabilityEstimator from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Run Code Online (Sandbox Code Playgroud)

现在我该怎么做?

  • 我可以降级到0.18.1并坚持下去,直到我准备重建模型.由于各种原因,我觉得这是不可接受的.

  • 我可以取消文件并重新腌制它.这与0.18.2一起工作,但以0.19打破.NFG.joblib看起来并不好.

  • 我希望我能以与版本无关的ASCII格式(例如,JSON或XML)保存数据.这显然是最佳解决方案,但似乎没有办法做到这一点(另请参阅Sklearn - 没有pkl文件的模型持久性).

  • 我可以将模型保存到PMML,但它的支持充其量是不冷不热的:我可以sklearn2pmml用来保存模型(虽然不容易),和augustus/ lightpmmlpredictor应用(尽管加载)模型.但是,没有一个可以pip直接使用,这使得部署成为一场噩梦.此外,augustus&lightpmmlpredictor项目似乎已经死了.将PMML模型导入Python(Scikit-learn) - 不.

  • 上述变体:使用PMML保存sklearn2pmml,并openscoring用于评分.需要与外部进程连接.育.

建议?

Dav*_*ale 5

跨不同版本的 scikit-learn 的模型持久性通常是不可能的。原因很明显:您Class1使用一个定义进行pickle,并希望使用另一种定义将其unpickle Class2

你可以:

  • 仍然尝试坚持使用一种版本的 sklearn。
  • 忽略警告并希望有效的方法Class1也适用于Class2
  • 编写您自己的类,可以序列化您的类GradientBoostingClassifier并从该序列化形式恢复它,并希望它比 pickle 工作得更好。

我举了一个示例,说明如何将单个数据转换DecisionTreeRegressor为完全兼容 JSON 的纯列表和字典格式,并将其恢复回来。

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_classification

### Code to serialize and deserialize trees

LEAF_ATTRIBUTES = ['children_left', 'children_right', 'threshold', 'value', 'feature', 'impurity', 'weighted_n_node_samples']
TREE_ATTRIBUTES = ['n_classes_', 'n_features_', 'n_outputs_']

def serialize_tree(tree):
    """ Convert a sklearn.tree.DecisionTreeRegressor into a json-compatible format """
    encoded = {
        'nodes': {},
        'tree': {},
        'n_leaves': len(tree.tree_.threshold),
        'params': tree.get_params()
    }
    for attr in LEAF_ATTRIBUTES:
        encoded['nodes'][attr] = getattr(tree.tree_, attr).tolist()
    for attr in TREE_ATTRIBUTES:
        encoded['tree'][attr] = getattr(tree, attr)
    return encoded

def deserialize_tree(encoded):
    """ Restore a sklearn.tree.DecisionTreeRegressor from a json-compatible format """
    x = np.arange(encoded['n_leaves'])
    tree = DecisionTreeRegressor().fit(x.reshape((-1,1)), x)
    tree.set_params(**encoded['params'])
    for attr in LEAF_ATTRIBUTES:
        for i in range(encoded['n_leaves']):
            getattr(tree.tree_, attr)[i] = encoded['nodes'][attr][i]
    for attr in TREE_ATTRIBUTES:
        setattr(tree, attr, encoded['tree'][attr])
    return tree

## test the code

X, y = make_classification(n_classes=3, n_informative=10)
tree = DecisionTreeRegressor().fit(X, y)
encoded = serialize_tree(tree)
decoded = deserialize_tree(encoded)
assert (decoded.predict(X)==tree.predict(X)).all()
Run Code Online (Sandbox Code Playgroud)

有了这个,您可以继续序列化和反序列化整个GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble.gradient_boosting import PriorProbabilityEstimator

def serialize_gbc(clf):
    encoded = {
        'classes_': clf.classes_.tolist(),
        'max_features_': clf.max_features_, 
        'n_classes_': clf.n_classes_,
        'n_features_': clf.n_features_,
        'train_score_': clf.train_score_.tolist(),
        'params': clf.get_params(),
        'estimators_shape': list(clf.estimators_.shape),
        'estimators': [],
        'priors':clf.init_.priors.tolist()
    }
    for tree in clf.estimators_.reshape((-1,)):
        encoded['estimators'].append(serialize_tree(tree))
    return encoded

def deserialize_gbc(encoded):
    x = np.array(encoded['classes_'])
    clf = GradientBoostingClassifier(**encoded['params']).fit(x.reshape(-1, 1), x)
    trees = [deserialize_tree(tree) for tree in encoded['estimators']]
    clf.estimators_ = np.array(trees).reshape(encoded['estimators_shape'])
    clf.init_ = PriorProbabilityEstimator()
    clf.init_.priors = np.array(encoded['priors'])
    clf.classes_ = np.array(encoded['classes_'])
    clf.train_score_ = np.array(encoded['train_score_'])
    clf.max_features_ = encoded['max_features_']
    clf.n_classes_ = encoded['n_classes_']
    clf.n_features_ = encoded['n_features_']
    return clf

# test on the same problem
clf = GradientBoostingClassifier()
clf.fit(X, y);
encoded = serialize_gbc(clf)
decoded = deserialize_gbc(encoded)
assert (decoded.predict(X) == clf.predict(X)).all()
Run Code Online (Sandbox Code Playgroud)

这适用于 scikit-learn v0.19,但不要问我下一个版本中会出现什么来打破这个代码。我既不是 sklearn 的先知,也不是开发者。

如果你想完全独立于新版本的 sklearn,最安全的做法是编写一个遍历序列化树并进行预测的函数,而不是重新创建 sklearn 树。