在Gensim LDA中记录主题分布

Question

在Gensim LDA中记录主题分布

我使用玩具语料库得出了一个LDA主题模型,如下所示:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

Run Code Online (Sandbox Code Playgroud)

我发现当我使用少量主题来推导模型时,Gensim会生成一份关于测试文档所有潜在主题的主题分布的完整报告.例如:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

Run Code Online (Sandbox Code Playgroud)

但是,当我使用大量主题时,报告不再完整:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

Run Code Online (Sandbox Code Playgroud)

在我看来,输出中省略了概率小于某个阈值的主题(我观察到0.01更具体).

我想知道这种行为是否是由于某些美学考虑因素造成的？如何在所有其他主题上获得概率质量残差的分布？

谢谢你的回答!

Answer 1

Mos*_* Xu 8

阅读源,结果发现概率小于阈值的主题被忽略.此阈值的默认值为0.01.

Answer 2

Mat*_*yra 8

我意识到这是一个老问题,但万一有人偶然发现它,这里有一个解决方案(问题实际上已经在当前开发分支中修复了一个minimum_probability参数,LdaModel但也许你正在运行旧版本的gensim).

定义一个新函数(这只是从源代码复制)

def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]

Run Code Online (Sandbox Code Playgroud)

上述函数不会根据概率过滤输出主题,但会输出所有这些主题.如果你不需要(topic_id, value)元组而只需要值,只需返回topic_dist而不是列表理解(它也会更快).

归档时间：	12 年，4 月前
查看次数：	10187 次
最近记录：	10 年，1 月前