标签: topic-modeling

推理标记 LDA/pLDA [主题建模工具箱]

我一直在尝试使用 TMT 工具箱（stanford nlp group）从训练有素的标记 LDA 模型和 pLDA 进行推理代码。我已经浏览了以下链接中提供的示例：http : //nlp.stanford.edu/software/tmt/tmt-0.3/ http://nlp.stanford.edu/software/tmt/tmt-0.4/

这是我正在尝试用于标记 LDA 推理的代码

val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7");

val model = LoadCVB0LabeledLDA(modelPath);`

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(model.tokenizer.get)      //tokenize with model's tokenizer
 }

 val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer())     
 }

 val outputPath = …

Run Code Online (Sandbox Code Playgroud)

nlp scala stanford-nlp lda topic-modeling

Roh*_*ain

2012 08-01

3
推荐指数

1
解决办法

1697
查看次数

Gensim LDA - 默认迭代次数

我想知道gensim的 LDA（潜在狄利克雷分配）算法中的默认迭代次数。我认为文档没有讨论这个。（迭代次数由初始化LdaModel 时的参数迭代表示）。谢谢！

python gensim topic-modeling

Uts*_*v T

2015 06-10

3
推荐指数

1
解决办法

5666
查看次数

在gensim LdaModel中提取主题词概率矩阵

我有 LDA 模型和文档主题概率。

# build the model on the corpus ldam = LdaModel(corpus=corpus, num_topics=20, id2word=dictionary) # get the document-topic probabilities theta, _ = ldam.inference(corpus)
Run Code Online (Sandbox Code Playgroud)
我还需要所有主题的词分布，即主题词概率矩阵。有没有办法提取这些信息？

谢谢！

python lda gensim topic-modeling

Clo*_*ave

lucky-day

3
推荐指数

1
解决办法

2188
查看次数

负值：评估具有主题一致性的 Gensim LDA

我目前正在尝试使用 gensim topiccoherencemodel 评估我的主题模型：

from gensim.models.coherencemodel import CoherenceModel cm_u_mass = CoherenceModel(model = model1, corpus = corpus1, coherence = 'u_mass') coherence_u_mass = cm_u_mass.get_coherence() print('\nCoherence Score: ', coherence_u_mass)
Run Code Online (Sandbox Code Playgroud)
输出只是负值。这样对吗？任何人都可以提供一个公式或 u_mass 是如何工作的吗？

evaluation python-3.x gensim topic-modeling

Nil*_*ter

lucky-day

3
推荐指数

1
解决办法

2997
查看次数

sklearn LatentDirichletAllocation 对新语料库的主题推断

我一直在使用 sklearn.decomposition.LatentDirichletAllocation 模块来探索文档语料库。经过多次迭代训练和调整模型（即添加停用词和同义词，改变主题数量），我对提炼出的主题相当满意和熟悉。作为下一步，我想将经过训练的模型应用于新的语料库。

是否可以将拟合模型应用于一组新文档以确定主题分布。

我知道这在 gensim 库中是可能的，您可以在其中训练模型：

from gensim.test.utils import common_texts from gensim.corpora.dictionary import Dictionary # Create a corpus from a list of texts common_dictionary = Dictionary(common_texts) common_corpus = [common_dictionary.doc2bow(text) for text in common_texts] lda = LdaModel(common_corpus, num_topics=10)
Run Code Online (Sandbox Code Playgroud)
然后将训练好的模型应用于新的语料库：

Topic_distribtutions = lda[unseen_doc]
Run Code Online (Sandbox Code Playgroud)
来自：https : //radimrehurek.com/gensim/models/ldamodel.html

如何使用 LDA 的 scikit-learn 应用程序来做到这一点？

python lda topic-modeling scikit-learn

J. *_*amp

2018 08-02

3
推荐指数

1
解决办法

1267
查看次数

如何修复LDA模型一致性分数运行时错误？

text='Alice 是一名学生。她喜欢学习。老师布置了很多家庭作业。'

我正在尝试从具有一致性分数的简单文本（如上）中获取主题。这是我的 LDA 模型：

id2word = corpora.Dictionary(data_lemmatized) texts = data_lemmatized corpus = [id2word.doc2bow(text) for text in texts] lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True) # Print the Keyword in the 10 topics pprint(lda_model.print_topics()) doc_lda = lda_model[corpus]
Run Code Online (Sandbox Code Playgroud)
当我尝试运行这个一致性模型时：

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v') coherence_lda = coherence_model_lda.get_coherence() print('\nCoherence Score: ', coherence_lda)
Run Code Online (Sandbox Code Playgroud)
我应该得到这个输出之王->一致性分数：0.532947587081

我收到此错误： raise RuntimeError(''' RuntimeError: 在当前进程完成引导阶段之前尝试启动新进程。

This probably means that you are not using fork to start your child processes and you have forgotten to …
Run Code Online (Sandbox Code Playgroud)

python nlp runtime-error lda topic-modeling

Xen*_*ena

2020 05-18

3
推荐指数

1
解决办法

1868
查看次数

python中GSDMM的一个实际例子？

我想使用 GSDMM 为我的数据集中的一些推文分配主题。我发现的唯一示例（1和2）不够详细。我想知道你是否知道一个显示 GSDMM 是如何使用 python 实现的源代码（或者足够关心来做一个小例子）。

python tweets lda topic-modeling

Pie*_*ton

lucky-day

3
推荐指数

1
解决办法

3150
查看次数

BERTopic 的停用词删除和词干提取/词形还原

对于主题建模，我正在尝试 BERTopic：链接

我在这里有点困惑，我正在我的自定义数据集上尝试 BERTopic。
由于 BERT 的训练方式使其能够保存文本/文档的语义，因此我是否应该在将文档传递到 BERTopic 之前删除停用词并对文档进行词干/词形还原？因为我担心这些停用词是否会作为显着术语进入我的主题，而它们并不是

请大家提出建议和建议！

python nlp topic-modeling bert-language-model

War*_*ckQ

lucky-day

3
推荐指数

1
解决办法

6313
查看次数

如何使用 BERTopics 计算各个主题下每个文档的概率？

我正在尝试使用BERTopic分析文档的主题分布，BERTopic执行后，我想计算每个文档各自主题下的概率，我应该怎么做？

# define model model = BERTopic(verbose=True, vectorizer_model=vectorizer_model, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 50, nr_topics=10) # train model headline_topics, _ = model.fit_transform(df1.review_processed3) # examine one of the topic a_topic = freq.iloc[0]["Topic"] # Select the 1st topic model.get_topic(a_topic) # Show the words and their c-TF-IDF scores
Run Code Online (Sandbox Code Playgroud)
下面是主题图像 1之一的单词及其 c-TF-IDF 分数

我应该如何将结果更改为如下主题分布，以便计算主题分布分数并确定主要主题？图2

python nlp topic-modeling bert-language-model

qwe*_*u13

2022 05-23

3
推荐指数

1
解决办法

1998
查看次数

来自serVis的LDAvis HTML输出是空白的

我是第一次尝试使用LDAvis,但遇到了以下问题:

在我的JSON对象上运行serVis后,

serVis(json, out.dir = 'LDAvis', open.browser = FALSE)
Run Code Online (Sandbox Code Playgroud)
创建了5个预期文件(即d3.v3.js,index.html,lda.css,lda.json和ldavis.js).据我了解LDAvis,打开html文件应打开交互式查看器.但是,在执行此操作时,仅打开空白网页.

我将html源代码与在线发现的LDAvis项目进行了比较,它们是相同的.这是使用Christopher Gandrud在此处找到的脚本构建的,其中LDA结果来自topicmodels包并使用了Gibbs方法.基础数据使用~45K文档,约15K唯一术语.对于它的价值,lda.json文件看起来有点小~6MB.

不幸的是,这个问题似乎太大,无法提供样本数据或可重现的代码.(如果我可以更多地隔离问题,那么也许我可以添加示例代码.)相反,我希望读者对此问题的原因有任何想法,或者它是否已经出现过.

在此先感谢任何反馈!

r lda topic-modeling

cdx*_*sza

lucky-day

2
推荐指数

1
解决办法

1423
查看次数

标签统计

topic-modeling ×10

python ×7

lda ×6

nlp ×4

gensim ×3

bert-language-model ×2

evaluation ×1

python-3.x ×1

r ×1

runtime-error ×1

scala ×1

scikit-learn ×1

stanford-nlp ×1

tweets ×1

我正在尝试从具有一致性分数的简单文本（如上）中获取主题。这是我的 LDA 模型：

当我尝试运行这个一致性模型时：

我应该得到这个输出之王->一致性分数：0.532947587081

标签 统计

标签统计