使用scikit学习获取最具信息性的功能的问题?

joh*_*doe 17 python nlp machine-learning pandas scikit-learn

我试图从文本语料库中获取最丰富的功能.从这个回答良好的问题我知道这项任务可以按如下方式完成:

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef
Run Code Online (Sandbox Code Playgroud)

然后:

most_informative_feature_for_class(tfidf_vect, clf, 5)
Run Code Online (Sandbox Code Playgroud)

对于这个classfier:

X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values


from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
                                                    y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)
Run Code Online (Sandbox Code Playgroud)

问题是输出most_informative_feature_for_class:

5 a_base_de_bien bastante   (0, 2451)   -0.210683496368
  (0, 3533) -0.173621065386
  (0, 8034) -0.135543062425
  (0, 10346)    -0.173621065386
  (0, 15231)    -0.154148294738
  (0, 18261)    -0.158890483047
  (0, 21083)    -0.297476572586
  (0, 434)  -0.0596263855375
  (0, 446)  -0.0753492277856
  (0, 769)  -0.0753492277856
  (0, 1118) -0.0753492277856
  (0, 1439) -0.0753492277856
  (0, 1605) -0.0753492277856
  (0, 1755) -0.0637950312345
  (0, 3504) -0.0753492277856
  (0, 3511) -0.115802483001
  (0, 4382) -0.0668983049212
  (0, 5247) -0.315713152154
  (0, 5396) -0.0753492277856
  (0, 5753) -0.0716096348446
  (0, 6507) -0.130661516772
  (0, 7978) -0.0753492277856
  (0, 8296) -0.144739048504
  (0, 8740) -0.0753492277856
  (0, 8906) -0.0753492277856
  : :
  (0, 23282)    0.418623443832
  (0, 4100) 0.385906085143
  (0, 15735)    0.207958503155
  (0, 16620)    0.385906085143
  (0, 19974)    0.0936828782325
  (0, 20304)    0.385906085143
  (0, 21721)    0.385906085143
  (0, 22308)    0.301270427482
  (0, 14903)    0.314164150621
  (0, 16904)    0.0653764031957
  (0, 20805)    0.0597723455204
  (0, 21878)    0.403750815828
  (0, 22582)    0.0226150073272
  (0, 6532) 0.525138162099
  (0, 6670) 0.525138162099
  (0, 10341)    0.525138162099
  (0, 13627)    0.278332617058
  (0, 1600) 0.326774799211
  (0, 2074) 0.310556919237
  (0, 5262) 0.176400451433
  (0, 6373) 0.290124806858
  (0, 8593) 0.290124806858
  (0, 12002)    0.282832270298
  (0, 15008)    0.290124806858
  (0, 19207)    0.326774799211
Run Code Online (Sandbox Code Playgroud)

它没有返回标签也没有返回单词.为什么会发生这种情况,如何打印文字和标签?你们这是发生了吗,因为我正在使用熊猫来读取数据?我尝试的另一件事是以下,形成这个问题:

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))


print_top10(tfidf_vect,clf,y)
Run Code Online (Sandbox Code Playgroud)

但我得到了这个追溯:

Traceback(最近一次调用最后一次):

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
    print_top10(tfidf_vect,clf,5)
  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
    for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable
Run Code Online (Sandbox Code Playgroud)

任何想法如何解决这个问题,以获得具有最高系数值的功能?

cha*_*ers 13

为了特别针对线性SVM解决这个问题,我们首先要了解sklearn中SVM的表达方式以及它对MultinomialNB的不同之处.

为什么most_informative_feature_for_classMultinomialNB 的工作原因是因为它的输出coef_基本上是给定一个类的特征的对数概率(因此[nclass, n_features]由于朴素贝叶斯问题的形成而具有大小.但是如果我们检查SVM 的文档,该coef_不是那么简单,而是coef_对(线性)SVM是[n_classes * (n_classes -1)/2, n_features]因为每个二进制模式被安装到每一个可能的类.

如果我们确实掌握了一些我们感兴趣的特定系数的知识,我们可以将函数改为如下所示:

def most_informative_feature_for_class_svm(vectorizer, classifier,  classlabel, n=10):
    labelid = ?? # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef
Run Code Online (Sandbox Code Playgroud)

这将按预期工作,并根据您所追求的系数向量打印出标签和前n个特征.

至于为特定类获取正确的输出,这取决于假设和您想要输出的内容.我建议阅读SVM文档中的多类文档,以了解您所追求的内容.

因此,使用此问题中描述的train.txt 文件,我们可以获得某种输出,但在这种情况下,它不是特别具有描述性或有助于解释.希望这会对你有所帮助.

import codecs, re, time
from itertools import chain

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

from sklearn.svm import SVC
svcc = SVC(kernel='linear', C=1)
svcc.fit(trainset, tags)

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
print 
most_informative_feature_for_class_svm(word_vectorizer, svcc)
Run Code Online (Sandbox Code Playgroud)

输出:

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

no 0.0204081632653
parecer 0.0204081632653
pone 0.0204081632653
por 0.0204081632653
relación 0.0204081632653
una 0.0204081632653
visto 0.0204081632653
ya 0.0204081632653
es 0.0408163265306
lo 0.0408163265306
Run Code Online (Sandbox Code Playgroud)