将NLTK NaiveBayesClassifier中最具信息性的功能存储在列表中

Rom*_*mes 7 python classification nltk scikit-learn naivebayes

我在python中尝试这个Naive Bayes分类器:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)
Run Code Online (Sandbox Code Playgroud)

我有以下输出:

控制台输出

可以清楚地看到哪些单词更多地出现在"重要"中,哪些出现在"垃圾邮件"类别中.但我无法使用这些值.我实际上想要一个如下所示的列表:

[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]
Run Code Online (Sandbox Code Playgroud)

我是python的新手并且很难搞清楚所有这些,有人可以帮忙吗?我会非常感激的.

Nic*_*eli 2

您可以稍微修改源代码show_most_informative_features以适合您的目的。

子列表的第一个元素对应于信息最丰富的特征名称,而第二个元素对应于其标签(更具体地说,与比率的分子项相关联的标签)。

辅助函数:

def show_most_informative_features_in_list(classifier, n=10):
    """
    Return a nested list of the "most informative" features 
    used by the classifier along with it's predominant labels
    """
    cpdist = classifier._feature_probdist       # probability distribution for feature values given labels
    feature_list = []
    for (fname, fval) in classifier.most_informative_features(n):
        def labelprob(l):
            return cpdist[l, fname].prob(fval)
        labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()], 
                        key=labelprob)
        feature_list.append([fname, labels[-1]])
    return feature_list
Run Code Online (Sandbox Code Playgroud)

在经过正面/负面电影评论语料库训练的分类器上进行测试nltk

show_most_informative_features_in_list(classifier, 10)
Run Code Online (Sandbox Code Playgroud)

产生:

[['outstanding', 'pos'],
 ['ludicrous', 'neg'],
 ['avoids', 'pos'],
 ['astounding', 'pos'],
 ['idiotic', 'neg'],
 ['atrocious', 'neg'],
 ['offbeat', 'pos'],
 ['fascination', 'pos'],
 ['symbol', 'pos'],
 ['animators', 'pos']]
Run Code Online (Sandbox Code Playgroud)