如何解释LDA组件(使用sklearn)？

Question

如何解释LDA组件(使用sklearn)？

LSz*_*LSz 12 python-3.x lda topic-modeling scikit-learn

我使用Latent Dirichlet Allocation(sklearn实现)来分析大约500篇科学文章摘要,并且我得到了包含最重要单词的主题(用德语).我的问题是解释与最重要的单词相关的这些值.我假设每个主题的所有单词的概率加起来为1,但实际情况并非如此.

我怎样才能解释这些价值观？例如,我希望能够说明为什么主题#20的单词值比其他主题高得多.他们的绝对高度与贝叶斯概率有关吗？该主题在语料库中更常见吗？我还没有把这些价值观与LDA背后的数学结合在一起.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
                                analyzer='word',
                                tokenizer = stemmer_sklearn.stem_ger())

tf = tf_vectorizer.fit_transform(texts)

n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 
                                learning_method='online',                 
                                learning_offset=50., random_state=0)

lda.fit(tf)

def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Nr.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 |

Run Code Online (Sandbox Code Playgroud)

Answer 1

Sim*_*dal 6

从文档中

Components_主题词分布的变分参数。由于主题词分布的完整条件是狄利克雷，因此 Components_[i, j] 可以视为伪计数，表示词 j 分配给主题 i 的次数。它也可以被视为标准化后每个主题的单词分布：model.components_ / model.components_.sum(axis=1)[:, np.newaxis]。

因此，如果您对组件进行标准化以评估主题中每个术语的重要性，则这些值可以视为分布。AFAIU 您不能使用伪计数来比较语料库中两个主题的重要性，因为它们是应用于术语主题分布的平滑因子。

归档时间：	10 年前
查看次数：	4353 次
最近记录：	8 年前