分层Dirichlet过程Gensim主题编号独立于语料库大小

Question

分层Dirichlet过程Gensim主题编号独立于语料库大小

我在一组文档上使用Gensim HDP模块.

>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> len(corpusA)
1113
>>> len(corpusB)
17

Run Code Online (Sandbox Code Playgroud)

为什么主题数量与语料库长度无关？

Answer 1

Rok*_*jic 10

由于gensim API的变化,@ Aaron上面的代码被破坏了.我重写并简化如下.自2017年6月起使用gensim v2.1.0

import pandas as pd

def topic_prob_extractor(gensim_hdp):
    shown_topics = gensim_hdp.show_topics(num_topics=-1, formatted=False)
    topics_nos = [x[0] for x in shown_topics ]
    weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]

    return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})

Run Code Online (Sandbox Code Playgroud)

我尝试了这个，我得到一个空的数据框。不幸的是，我无法发布这些数据，因为它是专有的。有没有人见过这个？ (3认同)
刚看到这个。感谢重构，@ Roko (2认同)

Answer 2

Raf*_*afs 8

@Aron 和 @Roko Mijic 的方法忽略了这样一个事实，即该函数show_topics默认仅返回每个主题的前 20 个单词。如果返回构成主题的所有单词，则在这种情况下所有近似的主题概率将为 1（或 0.999999）。我试验了以下代码，它是@Roko Mijic 的改编版：

def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
    """
    Input the gensim model to get the rough topics' probabilities
    """
    shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
    topics_nos = [x[0] for x in shown_topics ]
    weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
    if (isSorted):
        return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
    else:
        return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});

Run Code Online (Sandbox Code Playgroud)

一种更好但我不确定是否 100% 有效的方法是这里提到的方法。您可以获得 HDP 模型的主题的真实权重（alpha 向量）：

alpha = hdpModel.hdp_to_lda()[0];

Run Code Online (Sandbox Code Playgroud)

检查主题的等效 alpha 值比计算每个主题的前 20 个单词的权重以近似其在数据中的使用概率更合乎逻辑。

Answer 3

Far*_* ET 5

Gensim（版本 3.8.3）中显然存在一个错误，其中给予-1toshow_topics根本不返回任何内容。所以，我已经调整了很多答案ROKO Mijic和亚伦。

def topic_prob_extractor(gensim_hdp):
    shown_topics = gensim_hdp.show_topics(num_topics=gensim_hdp.m_T, formatted=False)
    topics_nos = [x[0] for x in shown_topics ]
    weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
    return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})

Run Code Online (Sandbox Code Playgroud)

Answer 4

Aar*_*ron 4

@user3907335 在这里完全正确：HDP 将计算与分配的截断级别一样多的主题。然而，可能很多这样的话题出现的概率基本上为零。为了在我自己的工作中帮助解决这个问题，我编写了一个方便的小函数，用于粗略估计与每个主题相关的概率权重。请注意，这只是一个粗略的度量：它没有考虑与每个单词相关的概率。即便如此，它还是提供了一个很好的指标，可以判断哪些主题有意义，哪些主题没有意义：

import pandas as pd
import numpy as np 

def topic_prob_extractor(hdp=None, topn=None):
    topic_list = hdp.show_topics(topics=-1, topn=topn)
    topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
    split_list = [x.split(' ') for x in topic_list]
    weights = []
    for lst in split_list:
        sub_list = []
        for entry in lst: 
            if '*' in entry: 
                sub_list.append(float(entry.split('*')[0]))
        weights.append(np.asarray(sub_list))
    sums = [np.sum(x) for x in weights]
    return pd.DataFrame({'topic_id' : topics, 'weight' : sums})

Run Code Online (Sandbox Code Playgroud)

我假设您已经知道如何计算 HDP 模型。一旦你有了 gensim 计算出的 hdp 模型，你就可以按如下方式调用该函数：

topic_weights = topic_prob_extractor(hdp, 500)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，10 月前
查看次数：	8222 次
最近记录：	7 年，6 月前