每次我在同一语料库上训练时,LDA模型会生成不同的主题

Question

每次我在同一语料库上训练时,LDA模型会生成不同的主题

alv*_*vas 15 python nlp lda gensim topic-modeling

我正在使用python gensim从231个句子的小型语料库中训练潜在Dirichlet分配(LDA)模型.但是,每次重复该过程时,它都会生成不同的主题.

为什么相同的LDA参数和语料库每次都会生成不同的主题？

我如何稳定主题生成？

我正在使用这个语料库(http://pastebin.com/WptkKVF0)和这个停用词列表(http://pastebin.com/LL7dqLcj),这是我的代码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("\t")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i

Run Code Online (Sandbox Code Playgroud)

Answer 1

Fre*_*Foo 30

为什么相同的LDA参数和语料库每次都会生成不同的主题？

因为LDA在训练和推理步骤中都使用随机性.

我如何稳定主题生成？

numpy.random每次执行模型训练或推理时,通过将种子重置为相同的值,使用numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

Run Code Online (Sandbox Code Playgroud)

(这很难看,它让Gensim的结果难以重现;考虑提交一个补丁.我已经开了一个问题.)

如果跟踪数据足够,则结果应收敛于有限的循环中.不是吗？ (2认同)

Answer 2

Vai*_*sai 6

random_state在LdaModel（）方法的初始化中设置参数。

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=num_topics,
                                            random_state=1,
                                            passes=num_passes,
                                            alpha='auto')

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，10 月前
查看次数：	10464 次
最近记录：	7 年，6 月前