Scikit模型运行w / OneHotEncoding后如何检索原始变量

Question

Scikit模型运行w / OneHotEncoding后如何检索原始变量

我已经从scikit-learn的SGDClassifier包中成功运行了一个逻辑回归模型，但是SGDClassifier.coef_由于输入数据是通过scikit-learn的OneHotEncoder进行转换的，因此无法轻松解释该模型的系数（通过访问）。

我的原始输入数据X的形状为（12000,11）：

X = np.array([[1,4,3...9,4,1],
              [5,9,2...3,1,4],
              ...
              [7,8,1...6,7,8]
              ])

Run Code Online (Sandbox Code Playgroud)

然后，我应用了一种热编码：

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X).toarray()

Run Code Online (Sandbox Code Playgroud)

产生形状数组（12000，696）：

X_OHE = np.array([[1,0,1...0,0,1],
                 [0,0,0...0,1,0],
                  ...
                 [1,0,1...0,0,1]
                 ])

Run Code Online (Sandbox Code Playgroud)

然后，我访问模型的系数，SGDClassifier.coef_从而产生形状数组（1,696）：

coefs = np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

Run Code Online (Sandbox Code Playgroud)

我如何将系数值映射回中的原始值X，所以我可以说类似“如果变量foo的值为bar，则目标变量增加/减少bar_coeff“？

让我知道您是否需要有关数据或模型参数的更多信息。谢谢。

我在SO上发现了一个未解决的问题：如何在scikit-learn上进行标签编码和一种热编码后检索系数名称？

Answer 1

Nic*_*gel 1

在查看了该用户的详细解释后OneHotEncoder ，我能够创建一种（有点 hack-y）的方法来将模型系数与原始数据集相关联。

假设您已正确设置OneHotEncoder：

from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

enc = OneHotEncoder()
X_OHE = enc.fit_transform(X)   # X and X_OHE as described in question

Run Code Online (Sandbox Code Playgroud)

您已成功运行 GLM 模型，例如：

from sklearn import linear_model

clf = linear_model.SGDClassifier()
clf.fit(X_train, y_train)

Run Code Online (Sandbox Code Playgroud)

其中有系数clf.coef_：

print clf.coef_
# np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

Run Code Online (Sandbox Code Playgroud)

您可以使用以下方法将编码的 1 和 0 追溯到X_OHE中的原始值X。我建议阅读OneHotEncoding（顶部链接）提到的详细解释，否则下面的内容看起来像是胡言乱语。但简而言之，下面的代码会迭代每个featureinX_OHE并使用feature_indices内部参数来enc进行翻译。

import pandas as pd
import numpy as np
results = []

for i in range(enc.active_features_.shape[0]):
    f = enc.active_features_[i]

    index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
    s = len(index_range) - 1
    f_index = index_range[-1]
    f_label_decoded = f - f_index

    results.append({
            'label_decoded_value': f_label_decoded,
            'coefficient': clf.coef_[0][i]
        })

R = pd.DataFrame.from_records(results)

Run Code Online (Sandbox Code Playgroud)

其中R看起来像这样（我最初编码了公司部门的名称）：

coefficient label_decoded_value
3.929413    DepartmentFoo1
3.718078    DepartmentFoo2
3.101869    DepartmentFoo3
2.892845    DepartmentFoo4
...

Run Code Online (Sandbox Code Playgroud)

因此，现在您可以说：“当员工位于“Foo1”部门时，目标变量增加 3.929413。

归档时间：	8 年，10 月前
查看次数：	520 次
最近记录：	8 年，10 月前