小编Ali*_*ice的帖子

SkLearn Multinomial NB:最具信息性的功能

由于我的分类器在测试数据上的准确率大约为99%,我有点怀疑并希望深入了解我的NB分类器中最具信息性的功能,以了解它正在学习哪种功能.以下主题非常有用:如何获取scikit-learn分类器的大部分信息功能？

至于我的功能输入,我还在玩,目前我正在测试一个简单的unigram模型,使用CountVectorizer:

 vectorizer = CountVectorizer(ngram_range=(1, 1), min_df=2, stop_words='english')

Run Code Online (Sandbox Code Playgroud)

在上述主题中,我发现了以下功能:

def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
    print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

Run Code Online (Sandbox Code Playgroud)

这给出了以下结果:

    -16.2420        114th                   -4.0020 said           
    -16.2420        115                     -4.6937 obama          
    -16.2420        136                     -4.8614 house          
    -16.2420        14th                    -5.0194 president      
    -16.2420        15th                    -5.1236 state          
    -16.2420        1600                    -5.1370 senate         
    -16.2420        16th                    -5.3868 new            
    -16.2420        1920                    -5.4004 republicans    
    -16.2420        1961                    -5.4262 republican     
    -16.2420        1981 …

Run Code Online (Sandbox Code Playgroud)

python classification machine-learning scikit-learn text-classification

Ali*_*ice

2017 05-23

8
推荐指数

1
解决办法

5299
查看次数

WordNet：遍历同义词集

对于一个项目，我想衡量文本中“以人为本”的单词的数量。我计划使用WordNet进行此操作。我从未使用过它，我也不知道如何完成此任务。我想使用WordNet来计算属于某些同义词集的词的数量，例如sysnets的“ human”和“ person”。

我提出了以下（简单）代码段：

word = 'girlfriend'
word_synsets = wn.synsets(word)[0]

hypernyms = word_synsets.hypernym_paths()[0]

for element in hypernyms:
    print element

Run Code Online (Sandbox Code Playgroud)

结果是：

Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')

Run Code Online (Sandbox Code Playgroud)

我的第一个问题是，如何正确迭代上位音？在上面的代码中，它们可以正常打印。但是，使用“ if”语句时，例如：

count_humancenteredness = 0
for element in hypernyms:
    if element == 'person':
        print 'found person hypernym'
        count_humancenteredness +=1

Run Code Online (Sandbox Code Playgroud)

我得到'AttributeError：'str'对象没有属性'_name'。当单词确实属于“人”或“人”同义词时，我可以使用什么方法来迭代单词的上位词并执行操作（例如，增加以人为中心的计数）。

其次，这是一种有效的方法吗？我假设遍历多个文本并遍历每个名词的上位字母将花费一些时间。也许还有另一种使用WordNet来更有效地执行任务的方法。

谢谢你的帮助！

python nltk wordnet

Ali*_*ice

2015 04-15

5
推荐指数

1
解决办法

1984
查看次数