标签: topic-modeling

主题建模 - 将具有前2个主题的文档指定为类别标签 - sklearn Latent Dirichlet Allocation

我现在正在通过LDA(Latent Dirichlet Allocation)主题建模方法来帮助从一组文档中提取主题.从我从下面的链接中理解的,这是一种无监督的学习方法,用提取的主题对每个文档进行分类/标记.

在该链接中给出的示例代码中,定义了一个函数来获取与所识别的每个主题相关联的顶部单词.

sklearn.__version__

Run Code Online (Sandbox Code Playgroud)

出[41]:'0.17'

from sklearn.decomposition import LatentDirichletAllocation 


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Run Code Online (Sandbox Code Playgroud)

我的问题是这个.是否有构建模型LDA的任何组件或矩阵,我们可以从哪里获得文档主题关联？

例如,我需要找到与每个文档关联的前2个主题作为该文档的文档标签/类别.是否有任何组件可以在文档中查找主题分布,类似于在主题中 model.components_查找单词分布.

python python-2.7 lda topic-modeling scikit-learn

Bal*_*ala

2015 12-23

9
推荐指数

1
解决办法

3337
查看次数

将pyLDAvis图导出为独立网页

我正在使用主题建模分析文本,并使用Gensim和pyLDAvis.想与远方的同事分享结果,而不需要他们安装python和所有必需的库.有没有办法将交互式图形导出为可以上传到任何Web服务器的HTML/JS文件？我发现文档中提到的东西,但不知道如何实现它:https: //github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/_display.py

python lda gensim topic-modeling

Dar*_*ius

lucky-day

9
推荐指数

1
解决办法

6291
查看次数

如何在 bertopic 建模中获取每个主题的所有文档

我有一个数据集并尝试使用 berTopic 建模将其转换为主题，但问题是，我无法获取主题的所有文档。berTopic 每个主题仅返回 3 个文档。

topic_model  = BERTopic(verbose=True, embedding_model=embedding_model,
                                nr_topics = 'auto',
                                n_gram_range = (3,3),
                                top_n_words = 10,
                               calculate_probabilities=True, 
                              seed_topic_list = topic_list,
                              )
topics, probs = topic_model.fit_transform(docs_test)
representative_doc = topic_model.get_representative_docs(topic#1)
representative_doc

Run Code Online (Sandbox Code Playgroud)

该主题包含超过 300 个文档，但 bertopic 仅显示其中 3 个.get_representative_docs

nlp topic-modeling text-classification bert-language-model

Kal*_*eem

2023 02-17

9
推荐指数

1
解决办法

6045
查看次数

为什么单个和批量文档的MALLET主题推断得到不同的结果？

我正在尝试使用Mallet 2.0.7执行LDA主题建模.根据训练课程的输出,我可以训练LDA模型并获得良好的结果.此外,我可以使用该过程中内置的inferencer,并在重新处理我的训练文件时获得类似的结果.但是,如果我从较大的训练集中获取单个文件,并使用推理器处理它,我会得到非常不同的结果,这是不好的.

我的理解是推理器应该使用固定模型,并且只有该文档的本地特征,所以我不明白为什么在处理1个文件或我的训练集中的1k时会得到任何不同的结果.我没有做频率截止,这似乎是一种具有这种效果的全局操作.你可以在下面的命令中看到我正在使用的其他参数,但它们大部分都是默认的.将迭代次数更改为0或100没有帮助.

导入数据:

bin/mallet import-dir \
  --input trainingDataDir \
  --output train.data \
  --remove-stopwords TRUE \
  --keep-sequence TRUE \
  --gram-sizes 1,2 \
  --keep-sequence-bigrams TRUE

Run Code Online (Sandbox Code Playgroud)

培养:

time ../bin/mallet train-topics
  --input ../train.data \
  --inferencer-filename lda-inferencer-model.mallet \
  --num-top-words 50 \
  --num-topics 100 \
  --num-threads 3 \
  --num-iterations 100 \
  --doc-topics-threshold 0.1 \
  --output-topic-keys topic-keys.txt \
  --output-doc-topics doc-topics.txt

Run Code Online (Sandbox Code Playgroud)

培训期间分配给一个文件的主题,特别是#14是关于正确的葡萄酒:

998 file:/.../29708933509685249 14  0.31684981684981683 
> grep "^14\t" topic-keys.txt 
14  0.5 wine spray cooking car climate top wines place live honey sticking ice prevent collection market …

Run Code Online (Sandbox Code Playgroud)

nlp machine-learning mallet topic-modeling

Joh*_*ann

2013 04-04

8
推荐指数

1
解决办法

4879
查看次数

如何在gensim中打印出LDA主题中单词的完整分布？

lda.show_topics以下代码中的模块仅打印每个主题的前10个单词的分布,如何打印语料库中所有单词的完整分布？

from gensim import corpora, models

documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

stoplist = …

Run Code Online (Sandbox Code Playgroud)

python lda gensim topic-modeling

alv*_*vas

2015 08-15

8
推荐指数

1
解决办法

5334
查看次数

Gensim LDA主题分配

我希望使用LDA将每个文档分配给一个主题.现在我意识到你得到的是LDA主题的分布.但是,正如您从下面的最后一行所看到的,我将其分配给最可能的主题.

我的问题是这个.lda[corpus]为了得到这些话题,我必须第二次跑步.是否有一些其他内置gensim函数将直接给我这个主题赋值向量？特别是因为LDA算法已通过文档,它可能已经保存了这些主题分配？

# Get the Dictionary and BoW of the corpus after some stemming/ cleansing
texts = [[stem(word) for word in document.split() if word not in STOPWORDS] for document in cleanDF.text.values]
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.9)
corpus = [dictionary.doc2bow(text) for text in texts]

# The actual LDA component
lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=30, chunksize=10000, passes=10,workers=4) 

# Assign each document to most prevalent topic
lda_topic_assignment = [max(p,key=lambda item: item[1]) for p in lda[corpus]]

Run Code Online (Sandbox Code Playgroud)

python lda gensim topic-modeling

sac*_*ruk

2019 09-21

8
推荐指数

1
解决办法

997
查看次数

gensim中的get_document_topics和get_term_topics

该ldamodel在gensim有两种方法:get_document_topics和get_term_topics.

尽管他们在这个gensim教程笔记本中使用,我还不完全理解如何解释输出get_term_topics并创建下面的自包含代码来显示我的意思:

from gensim import corpora, models

texts = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

# build the corpus, dict and train the model
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
model = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, 
                                 random_state=0, chunksize=2, passes=10)

# show the topics
topics …

Run Code Online (Sandbox Code Playgroud)

python gensim topic-modeling

tkj*_*kja

2017 04-22

8
推荐指数

1
解决办法

1万
查看次数

如何仅在gensim中访问主题词

我使用 Gensim 构建了 LDA 模型，我只想获取主题词如何仅获取主题词没有概率也没有 IDs.words

我在 gensim 中尝试了 print_topics() 和 show_topics() 函数，但我找不到干净的词！

这是我使用的代码

dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=12, id2word = dictionary, passes = 100, alpha='auto', update_every=5)
x = ldamodel.print_topics(num_topics=12, num_words=5)
for i in x:
    print(i[1])
    #print('\n' + str(i))

0.045*???? + 0.045*??????? + 0.045*??????? + 0.045*?????? + 0.045*?????
0.021*??? + 0.021*??????????? + 0.021*???? + 0.021*???? + 0.021*???????
0.068*???????? + 0.068*???????? + 0.068*????????? + 0.068*????? + 0.005*????
0.033*????? + …

Run Code Online (Sandbox Code Playgroud)

python nlp lda gensim topic-modeling

Muh*_*akh

lucky-day

8
推荐指数

2
解决办法

4744
查看次数

LDA 主题模型性能 - scikit-learn 的主题一致性实现

我有一个关于测量/计算 scikit-learn 中构建的 LDA 模型的主题一致性的问题。

主题一致性是衡量给定 LDA 主题模型的人类可解释性的有用指标。Gensim 的CoherenceModel允许为给定的 LDA 模型（包括几个变体）计算 Topic Coherence。

我对利用scikit-learn 的 LDA而不是gensim 的 LDA 感兴趣，以便于使用和记录（注意：我想避免使用 gensim 到 scikit-learn 包装器，即实际上利用 sklearn 的 LDA）。根据我的研究，似乎没有与 Gensim 的 CoherenceModel 等效的 scikit-learn。

有没有办法：

1 - 将 scikit-learn 的 LDA 模型输入到 gensim 的 CoherenceModel 管道中，通过手动将 scikit-learn 模型转换为 gensim 格式，或者通过 scikit-learn 到 gensim 包装器（我已经看到了包装器）来生成主题一致性？

或者

2 - 从 scikit-learn 的 LDA 模型和 CountVectorizer/Tfidf 矩阵手动计算主题一致性？

我在网上对这个用例的实现做了很多研究，但还没有看到任何解决方案。我唯一的线索是科学文献中记录的方程式。

如果有人对任何类似的实现有任何了解，或者如果您能指出我为此创建手动方法的正确方向，那就太好了。谢谢！

*旁注：我知道在 scikit-learn 中可以使用 perplexity 和 log-likelihood 进行性能测量，但从我读到的内容来看，这些并没有那么预测。

nlp lda gensim topic-modeling scikit-learn

lea*_*guy

2018 08-31

8
推荐指数

1
解决办法

6599
查看次数

如何加速R中的主题模型？

背景我试图使用以下数据和规范文档来拟合主题模型= 140 000,单词= 3000,主题= 15.我topicmodels在Windows 7机器上使用R(3.1.2)中的软件包(ram 24 GB ,8核心).我的问题是计算只能继续进行,而不会产生任何"收敛".

我在LDA()函数中使用默认选项topicmodels:

运行模型

dtm2.sparse_TM <- LDA(dtm2.sparse, 15)

Run Code Online (Sandbox Code Playgroud)

该模型已经运行了大约72个小时 - 仍然像我写的那样.

问题所以,我的问题是(a)这是否是正常行为; (b)如果不是第一个问题,你对做什么有任何建议; (c)如果对第一个问题是肯定的,我怎样才能大大提高计算的速度？

附加信息:原始数据不包含3000字,但约370万字.当我跑(在同一台机器上)它没有收敛,甚至在几周之后.所以我用300个单词运行它,只有500个文件(随机选择)并且都不是很好.对于所有模型,我使用与以前相同的主题和默认值.

因此,对于我当前的模型(请参阅我的问题),我在tm包的帮助下删除了稀疏术语.

删除稀疏术语

dtm2.sparse <- removeSparseTerms(dtm2, 0.9)

Run Code Online (Sandbox Code Playgroud)

感谢提前输入Adel

r machine-learning lda unsupervised-learning topic-modeling

Ade*_*del

2016 01-04

7
推荐指数

1
解决办法

1873
查看次数

标签统计

topic-modeling ×10

lda ×7

gensim ×6

python ×6

nlp ×4

machine-learning ×2

scikit-learn ×2

bert-language-model ×1

mallet ×1

python-2.7 ×1

r ×1

text-classification ×1

unsupervised-learning ×1

标签 统计

标签统计