vis*_*akh 26 python dirichlet gensim topic-modeling
我试图了解Python中的gensim包如何实现Latent Dirichlet Allocation.我正在做以下事情:
定义数据集
documents = ["Apple is releasing a new product",
"Amazon sells many things",
"Microsoft announces Nokia acquisition"]
Run Code Online (Sandbox Code Playgroud)
删除停用词后,我创建了字典和语料库:
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
Run Code Online (Sandbox Code Playgroud)
然后我定义了LDA模型.
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, update_every=1, chunksize=10000, passes=1)
Run Code Online (Sandbox Code Playgroud)
然后我打印主题:
>>> lda.print_topics(5)
['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft']
2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product
2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new
2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is
2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new
2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft
>>>
Run Code Online (Sandbox Code Playgroud)
我无法理解这个结果.它是否提供了每个单词出现的概率?另外,主题#1,主题#2等的含义是什么?我期待的东西或多或少像最重要的关键词.
我已经检查了gensim教程,但它并没有真正帮助太多.
谢谢.
Uts*_*v T 17
我认为本教程将帮助您非常清楚地了解所有内容 - https://www.youtube.com/watch?v=DDq3OVp9dNA
起初我也很难理解它.我将简要概述几点.
在潜在的Dirichlet分配中,
想象一下创建一个类似这样的文档的过程 -
LDA在这条线上有点回溯 - 你有一袋代表文件的文字,它代表的主题是什么?
所以,在你的情况下,第一个主题(0)
INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product
Run Code Online (Sandbox Code Playgroud)
更多的是things,amazon和many他们有更高的比例,而不是这么多microsoft或apple其中有显著较低的值.
我建议阅读这篇博客以获得更好的理解(Edwin Chen是个天才!) - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
| 归档时间: |
|
| 查看次数: |
26692 次 |
| 最近记录: |