我使用Latent Dirichlet Allocation(sklearn实现)来分析大约500篇科学文章摘要,并且我得到了包含最重要单词的主题(用德语).我的问题是解释与最重要的单词相关的这些值.我假设每个主题的所有单词的概率加起来为1,但实际情况并非如此.
我怎样才能解释这些价值观?例如,我希望能够说明为什么主题#20的单词值比其他主题高得多.他们的绝对高度与贝叶斯概率有关吗?该主题在语料库中更常见吗?我还没有把这些价值观与LDA背后的数学结合在一起.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
analyzer='word',
tokenizer = stemmer_sklearn.stem_ger())
tf = tf_vectorizer.fit_transform(texts)
n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50., random_state=0)
lda.fit(tf)
def print_top_words(model, feature_names, n_top_words):
for topic_id, topic in enumerate(model.components_):
print('\nTopic Nr.%d:' % int(topic_id + 1))
print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
+' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))
n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Topic …Run Code Online (Sandbox Code Playgroud)