Arv*_*eer 7 lda gensim scikit-learn
这里,best_model_lda 是一个基于 sklearn 的 LDA 模型,我们正在尝试找到该模型的一致性分数。
coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Coherence Score :',coherence_lda)
Run Code Online (Sandbox Code Playgroud)
输出:出现此错误是因为我正在尝试查找 sklearn LDA 主题模型的连贯性分数,有没有办法解决它。另外,sklearn LDA 使用什么指标将这些单词分组在一起?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics_from_model(model, topn)
490 matutils.argsort(topic, topn=topn, reverse=True) for topic in
--> 491 model.get_topics()
492 ]
AttributeError: 'LatentDirichletAllocation' object has no attribute 'get_topics'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-106-ce8558d82330> in <module>
----> 1 coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v')
2 coherence_lda = coherence_model_lda.get_coherence()
3 print('\n Coherence Score :',coherence_lda)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in __init__(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
210 self._accumulator = None
211 self._topics = None
--> 212 self.topics = topics
213
214 self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in topics(self, topics)
433 self.model)
434 elif self.model is not None:
--> 435 new_topics = self._get_topics()
436 logger.debug("Setting topics to those of the model: %s", self.model)
437 else:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics(self)
467 def _get_topics(self):
468 """Internal helper function to return topics from a trained topic model."""
--> 469 return self._get_topics_from_model(self.model, self.topn)
470
471 @staticmethod
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics_from_model(model, topn)
493 except AttributeError:
494 raise ValueError(
--> 495 "This topic model is not currently supported. Supported topic models"
496 " should implement the `get_topics` method.")
497
ValueError: This topic model is not currently supported. Supported topic models should implement the `get_topics` method.```
Run Code Online (Sandbox Code Playgroud)
您可以使用tmtoolkit计算 gensim CoherenceModel 提供的四个相干性分数中的每一个。该文档的作者声称方法tmtoolkit.topicmod.evaluate.metric_coherence_gensim “还支持来自 lda 和 sklearn 的模型(通过传递 topic_word_distrib、dtm 和 vocab)! ”。
因此,要获得“c_v”一致性度量:
# lda_model - LatentDirichletAllocation()
# vect - CountVectorizer()
# texts - the list of tokenized words
metric_coherence_gensim(measure='c_v',
top_n=25,
topic_word_distrib=lda_model.components_,
dtm=dtm_tf,
vocab=np.array([x for x in vect.vocabulary_.keys()]),
texts=train['cleaned_NOUN'].values)
Run Code Online (Sandbox Code Playgroud)
关于问题的第二部分 - 据我所知,困惑度(通常与人类的感知不一致)是 sklearn 的 LDA 实现评估的原生方法。
归档时间: |
|
查看次数: |
12623 次 |
最近记录: |