如何使用朴素贝叶斯的tf-idf？

Question

如何使用朴素贝叶斯的tf-idf？

POO*_*PTA 11 tf-idf python-2.7 naivebayes

根据我对这个查询的搜索,我在这里发帖,我有很多提出解决方案的链接,但没有提到究竟是怎么做的.例如,我已经探索过以下链接:

等等

因此,我正在理解如何在这里使用带有tf-idf的朴素贝叶斯公式,它如下:

朴素贝叶斯公式:

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

Run Code Online (Sandbox Code Playgroud)

tf-idf加权可以在上面的公式中使用:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

Run Code Online (Sandbox Code Playgroud)

这个问题已经在堆栈溢出上多次发布,但到目前为止还没有回答任何实质性问题.我想知道我正在考虑问题的方式是否正确,即我上面已经说明的实现.我需要知道这一点,因为我自己实现了朴素贝叶斯,而没有得到任何带有Naive Bayes和tf-idf的内置函数的Python库的帮助.我真正想要的是提高使用Naive Bayes训练分类器的模型的准确度(目前为30%).因此,如果有更好的方法来达到良好的准确性,欢迎提出建议.

请建议我.我是这个领域的新手.

Answer 1

jrh*_*e17 6

如果你真的给了我们你想要使用的确切功能和类,或者至少举个例子,那会更好.由于没有具体给出,我只假设以下是你的问题:

您有许多文档,每个文档都有许多单词.
您希望将文档分类.
您的要素向量由所有文档中的所有可能单词组成,并且具有每个文档中的计数数量值.

你的解决方案

你给的tf idf如下:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

Run Code Online (Sandbox Code Playgroud)

你的方法听起来合理.所有概率的总和将总和为1,与tf-idf函数无关,并且这些特征将反映tf-idf值.我想说这看起来像是将tf-idf合并到NB中的可靠方法.

另一种可能的解

我花了一段时间来解决这个问题.其主要原因是担心维持概率正常化.使用高斯朴素贝叶斯将有助于完全忽略这个问题.

如果您想使用此方法:

计算平均值,每个类的tf-idf值的变化.
使用由上述均值和变化产生的高斯分布来计算先验.
继续正常(乘以先前)并预测值.

硬编码这不应该太难,因为numpy固有地具有高斯函数.我只是喜欢这种类型的通用解决方案来解决这些问题.

增加的其他方法

除上述内容外,您还可以使用以下技术来提高准确性:

预处理:
1. 功能减少(通常是NMF,PCA或LDA)
2. 附加功能
算法:

朴素的贝叶斯速度很快,但本质上比其他算法表现更差.执行特征缩减可能更好,然后切换到辨别模型,例如SVM或Logistic回归
杂项.

引导,提升等.小心不要过度装备......

希望这很有帮助.如果有任何不清楚的地方发表评论

归档时间：	9 年，6 月前
查看次数：	7454 次
最近记录：	7 年，5 月前