如何在python-gensim中使用Latent Dirichlet Allocation(LDA)抽象bigram主题而不是unigrams？

Question

LDA原始输出

任何的想法？

Answer 1

鉴于我有一个叫做dict的dict docs,包含文档中的单词列表,我可以使用nltk.util.ngrams或你自己的函数将它变成一个单词+ bigrams(或者还有trigrams等)的数组:

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

然后将此dict的值作为语料库传递给LDA模型.因此,由下划线连接的Bigrams被视为单个令牌.

Answer 2

您可以使用 word2vec 从使用 LDA 抽象的前 n 个主题中获取最相似的术语。

LDA输出

使用抽象主题创建二元词词典（例如：-san_francisco）

然后，执行 word2vec 来获取最相似的单词（uni-grams、bi-grams 等）

字距离和 余弦距离

los_angeles (0.666175)
Golden_gate (0.571522)
奥克兰 (0.557521)

检查https://code.google.com/p/word2vec/ （从单词到短语及其他）