将 sklearn.svm.SVR 模型保存为 JSON 而不是 pickling

Question

将 sklearn.svm.SVR 模型保存为 JSON 而不是 pickling

P0i*_*MaN 5 machine-learning svm scikit-learn

我有一个训练有素的 SVR 模型，需要以JSON格式保存，而不是腌制。

JSON 化训练模型背后的想法是简单地捕获权重和其他“拟合”属性的状态。然后，我可以稍后设置这些属性来进行预测。这是我所做的一个实现：

    # assume SVR has been trained
    regressor = SVR()
    regressor.fit(x_train, y_train)

    # saving the regressor params in a JSON file for later retrieval
    with open(f'saved_regressor_params.json', 'w', encoding='utf-8') as outfile:
        json.dump(regressor.get_params(), outfile)

    # finding the fitted attributes of SVR()
    # if an attribute is trailed by '_', it's a fitted attribute
    attrs = [i for i in dir(regressor) if i.endswith('_') and not i.endswith('__')]
    remove_list = ['coef_', '_repr_html_', '_repr_mimebundle_'] # unnecessary attributes
    
    for attr in remove_list:
        if attr in attrs:
            attrs.remove(attr)


    # deserialize NumPy arrays and save trained attribute values into JSON file
    attr_dict = {i: getattr(regressor, i) for i in attrs}

    for k in attr_dict:
        if isinstance(attr_dict[k], np.ndarray):
            attr_dict[k] = attr_dict[k].tolist()

    # dump JSON for prediction
    with open(f'saved_regressor_{index}.json', 'w', encoding='utf-8') as outfile:    
        json.dump(attr_dict, 
                    outfile, 
                    separators=(',', ':'), 
                    sort_keys=True, 
                    indent=4)

Run Code Online (Sandbox Code Playgroud)

这将创建两个单独的json文件。一个被调用的文件saved_regressor_params.json保存 SVR 所需的某些参数，另一个被调用的文件将saved_regressor.json属性及其训练值存储为对象。示例（saved_regressor.json）：

{
    "_dual_coef_":[
        [
            -1.0,
            -1.0,
            -1.0,
        ]
    ],
    "_intercept_":[
        1.323423423
    ],
         ...
         ...

    "_n_support_":[
        3
    ]
}

Run Code Online (Sandbox Code Playgroud)

稍后，我可以创建一个新的 SVR() 模型，并通过从我们刚刚创建的现有 JSON 文件中调用它们来简单地将这些参数和属性设置到其中。然后，调用该predict()方法进行预测。像这样（在新文件中）：

predict_svr = SVR()

#load the json from the files
obj_text = codecs.open('saved_regressor_params.json', 'r', encoding='utf-8').read()
params = json.loads(obj_text)

obj_text = codecs.open('saved_regressor.json', 'r', encoding='utf-8').read()
attributes = json.loads(obj_text)

#setting params
predict_svr.set_params(**params)

# setting attributes
for k in attributes:
        if isinstance(attributes[k], list):
            setattr(predict_svr, k, np.array(attributes[k]))
        else:
            setattr(predict_svr, k, attributes[k])
        
predict_svr.predict(...)

Run Code Online (Sandbox Code Playgroud)

n_support_然而，在此过程中，由于某种原因，无法设置名为：的特定属性。即使我忽略n_support_属性，它也会产生额外的错误。（我的逻辑是错误的还是我在这里遗漏了什么？）

因此，我正在寻找不同的方式或巧妙的方法将 SVR 模型保存为 JSON。

我已经尝试过现有的第三方帮助程序库，例如：sklearn_json。这些库往往可以完美地导出线性模型，但不能导出支持向量。

Answer 1

Bob*_*Bob 1

根据文档（版本 1.1.2），制作 OP 中缺少的可重现示例

from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
regressor = SVR(C=1.0, epsilon=0.2)
regressor.fit(X, y)

Run Code Online (Sandbox Code Playgroud)

然后是 JSON 序列化/反序列化的草图

import json
# serialize
serialized = json.dumps({
    k: v.tolist() if isinstance(v, np.ndarray) else v 
    for k, v in regressor.__dict__.items()
})

# deserialize
regressor2 = SVR()
regressor2.__dict__ = {
     k: np.asarray(v) if isinstance(v, list) else v 
     for k, v in json.loads(serialized).items()
}

# test
assert np.all(regressor.predict(X) == regressor2.predict(X))

Run Code Online (Sandbox Code Playgroud)

编辑：序列化保留数据类型

解决评论中提到的第一个问题的一个不太优雅的解决方案是将数据类型与数据一起保存。

import json
# serialize


serialized = json.dumps({
    k: [v.tolist(), 'np.ndarray', str(v.dtype)] if isinstance(v, np.ndarray) else v 
    for k, v in regressor.__dict__.items()
})

# deserialize
regressor2 = SVR()
regressor2.__dict__ = {
     k: np.asarray(v[0], dtype=v[2]) if isinstance(v, list) and v[1] == 'np.ndarray' else v 
     for k, v in json.loads(serialized).items()
}

# test
assert np.all(regressor.predict(X) == regressor2.predict(X))

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，6 月前
查看次数：	748 次
最近记录：	3 年，6 月前