我试图检查BOW语料库与LDA [BOW语料库]的内容(由在该语料库上训练的LDA模型转换,例如35个主题)我发现了以下输出:
DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]
LDA 1 : [(29, 0.80571428571428572)]
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]
LDA 2 : [(29, 0.83809523809523812)]
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]
LDA 3 : [(34, 0.75714285714285712)]
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, …Run Code Online (Sandbox Code Playgroud) 我正在使用 gensim 的 LDA 来执行主题建模。我知道如何将原始文本数据转换为语料库并获取主题。但是,在获得主题后,我可以将主题结果标记或添加回原始文档吗?
这是我的代码:
movie_reviews = pd.read_csv(data_path + 'movie_review.tsv',header=0,delimiter='\t',quoting=3)
reviews = []
for i in range(len(movie_reviews['review'])):
reviews.append(review_to_words(movie_reviews['review']
[i],stops=stopwords.words('english')))
from gensim import corpora
dictionary = corpora.Dictionary(reviews)
corpus = [dictionary.doc2bow(review) for review in reviews]
from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lda = lda[corpus_tfidf]
lda.print_topics(10)
[(0,
u'0.001*ben + 0.001*sinatra + 0.001*santa + 0.001*henry + 0.001*band + 0.001*william + 0.001*fool + 0.001*tragic + 0.001*favourite + 0.001*bed'),
(1,
u'0.002*dentist + 0.002*homeless + 0.002*connery + …Run Code Online (Sandbox Code Playgroud)