Chr*_*her 5 nlp regression scikit-learn
我正在尝试eli5以了解术语对某些类的预测的贡献。
你可以运行这个脚本:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
#categories = ['alt.atheism', 'soc.religion.christian']
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']
np.random.seed(1)
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=7)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=7)
bow_model = CountVectorizer(stop_words='english')
clf = LogisticRegression()
pipel = Pipeline([('bow', bow),
('classifier', clf)])
pipel.fit(train.data, train.target)
import eli5
eli5.show_weights(clf, vec=bow, top=20)
Run Code Online (Sandbox Code Playgroud)
问题:
使用两个标签时,不幸的是输出仅限于一张表:
categories = ['alt.atheism', 'soc.religion.christian']
Run Code Online (Sandbox Code Playgroud)
但是,当使用三个标签时,它也会输出三个表。
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']
Run Code Online (Sandbox Code Playgroud)
它在第一个输出中错过了 y=0 是软件中的错误,还是我错过了统计点?对于第一种情况,我希望看到两个表。
这与 eli5 无关,而是与 scikit-learn(在本例中LogisticRegression())如何处理两个类别有关。对于只有两个类别,问题变成了一个二元类别,因此从学习的分类器中只返回一列属性。
查看LogisticRegression的属性:
coef_ : 数组、形状 (1, n_features) 或 (n_classes, n_features)
Run Code Online (Sandbox Code Playgroud)Coefficient of the features in the decision function. coef_ is of shape (1, n_features) when the given problem is binary.拦截_:数组,形状(1,)或(n_classes,)
Run Code Online (Sandbox Code Playgroud)Intercept (a.k.a. bias) added to the decision function. If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape(1,) when the problem is binary.
coef_是(1, n_features)二进制时的形状。这coef_是由eli5.show_weights().
希望这能说清楚。
| 归档时间: |
|
| 查看次数: |
4364 次 |
| 最近记录: |