相干分数0.4是好是坏？

Question

相干分数0.4是好是坏？

我需要知道0.4的一致性得分是好还是坏？我使用LDA作为主题建模算法。

在这种情况下，平均相干分数是多少。

Answer 1

连贯性衡量主题中单词之间的相对距离。C_V有两种主要类型，通常为0 <x <1和uMass -14 <x <14。很少会看到相干性为1或+.9，除非要测量的单词是相同的单词或双字母组。像美国和美国可能会返回〜.94的连贯性得分，或者英雄和英雄会返回1的连贯性。主题的整体连贯性得分是单词之间距离的平均值。如果我使用的是c_v，我尝试在LDA中获得.7，我认为这与主题相关性很强。我会说：

.3不好

.4低

.55还可以

.65可能会和获得的一样好

.7很好

.8不太可能，并且

.9可能是错误的

低一致性修复：

调整参数alpha = .1，beta = .01或.001，seed = 123等
获得更好的数据
在.4处，您可能遇到了错误的主题数，请查看https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/，了解所谓的肘方法-它为您提供了最佳图形数据集中具有最大一致性的主题数。我使用的槌具有很好的连贯性，这里的代码可以检查不同主题数的连贯性：

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

Run Code Online (Sandbox Code Playgroud)

我希望这有帮助：）

您能否推荐一篇论文，其中您提供的分数和级别是在实验中设定的？ (2认同)

归档时间：	7 年前
查看次数：	2279 次
最近记录：	6 年，10 月前