我是 python 新手,我需要构建一个 LDA 项目。做了一些预处理步骤后,这是我的代码:
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
from gensim.models import LdaModel
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None
temp = dictionary[0]
id2word = dictionary.id2token
model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
random_state=42, \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
Run Code Online (Sandbox Code Playgroud)
我想获得文档的主题分布,所有文档并获得主题分布的 10 概率,但是当我使用:
get_document_topics = model.get_document_topics(corpus)
print(get_document_topics)
Run Code Online (Sandbox Code Playgroud)
输出只出现
<gensim.interfaces.TransformedCorpus object at 0x000001DF28708E10>
Run Code Online (Sandbox Code Playgroud)
如何获得文档的主题分布?
python-3.x lda gensim topic-modeling probability-distribution
这是我的代码:
data = pd.read_csv('asscsv2.csv', encoding = "ISO-8859-1", error_bad_lines=False);
data_text = data[['content']]
data_text['index'] = data_text.index
documents = data_text
Run Code Online (Sandbox Code Playgroud)
看起来像
print(documents[:2])
content index
0 Pretty extensive background in Egyptology and ... 0
1 Have you guys checked the back end of the Sphi... 1
Run Code Online (Sandbox Code Playgroud)
我使用gensim定义了一个预处理函数
stemmer = PorterStemmer()
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
Run Code Online (Sandbox Code Playgroud)
当我使用此功能时:
processed_docs = documents['content'].map(preprocess)
Run Code Online (Sandbox Code Playgroud)
它出现
TypeError: decoding to …Run Code Online (Sandbox Code Playgroud)