ble*_*man 2 python lda gensim topic-modeling
另一个线程有一个与我类似的问题,但遗漏了可重现的代码。
有问题的脚本的目标是创建一个尽可能具有内存效率的进程。所以我尝试编写一个类corpus()来利用 gensims 的功能。但是,我遇到了一个 IndexError,我不确定在创建lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)).
我使用的文档与 gensim 教程中使用的文档相同,我将其放入 tutorial_example.txt:
$ cat tutorial_example.txt
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey
Run Code Online (Sandbox Code Playgroud)
$./gensim_topic_modeling.py -mn2 -w'english' -l1 tutorial_example.txt
Traceback (most recent call last):
File "./gensim_topic_modeling.py", line 98, in <module>
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 306, in __init__
self.update(corpus)
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 543, in update
self.log_perplexity(chunk, total_docs=lencorpus)
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 454, in log_perplexity
perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 630, in bound
gammad, _ = self.inference([doc])
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 366, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index 7 is out of bounds for axis 1 with size 7
Run Code Online (Sandbox Code Playgroud)
下面是gensim_topic_modeling.py脚本:
##gensim_topic_modeling.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import re
import codecs
import logging
import fileinput
from operator import *
from itertools import *
from sklearn.cluster import KMeans
from gensim import corpora, models, similarities, matutils
import argparse
from nltk.corpus import stopwords
reload(sys)
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stdin = codecs.getreader('utf-8')(sys.stdin)
##defs
def stop_word_gen():
nltk_langs=['danish', 'dutch', 'english', 'french', 'german', 'italian','norwegian', 'portuguese', 'russian', 'spanish', 'swedish']
stoplist = []
for lang in options.stop_langs.split(","):
if lang not in nltk_langs:
sys.stderr.write('\n'+"Language {0} not supported".format(lang)+'\n')
continue
stoplist.extend(stopwords.words(lang))
return stoplist
def clean_texts(texts):
# remove tokens that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
return [[word for word in text if word not in tokens_once] for text in texts]
##class
class corpus(object):
"""sparse vector matrix and dictionary"""
def __iter__(self):
first=True
for line in fileinput.FileInput(options.input, openhook=fileinput.hook_encoded("utf-8")):
# assume there's one document per line; tokenizer option determines how to split
if options.space_tokenizer:
rl = re.compile('\s+', re.UNICODE).split(unicode(line,'utf-8'))
else:
rl = re.compile('\W+', re.UNICODE).split(tagRE.sub(' ',line))
# create dictionary
tokens=[token.strip().lower() for token in rl if token != '' and token.strip().lower() not in stoplist]
if first:
first=False
self.dictionary=corpora.Dictionary([tokens])
else:
self.dictionary.add_documents([tokens])
self.dictionary.compactify
yield self.dictionary.doc2bow(tokens)
##main
if __name__ == '__main__':
##parser
parser = argparse.ArgumentParser(
description="Topic model from a column of text. Each line is a document in the corpus")
parser.add_argument("input", metavar="args")
parser.add_argument("-l", "--document-frequency-limit", dest="doc_freq_limit", default=1,
help="Remove all tokens less than or equal to limit (default 1)")
parser.add_argument("-m", "--create-model", dest="create_model", default=False, action="store_true",
help="Create and save a model from existing dictionary and input corpus.")
parser.add_argument("-n", "--number-of-topics", dest="number_of_topics", default=2,
help="Number of topics (default 2)")
parser.add_argument("-t", "--space-tokenizer", dest="space_tokenizer", default=False, action="store_true",
help="Use alternate whitespace tokenizer")
parser.add_argument("-w", "--stop-word-languages", dest="stop_langs", default="danish,dutch,english,french,german,italian,norwegian,portuguese,russian,spanish,swedish",
help="Desired languages for stopword lists")
options = parser.parse_args()
##globals
stoplist=set(stop_word_gen())
tagRE = re.compile(r'<.*?>', re.UNICODE) # Remove xml/html tags
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename="topic-modeling-log")
logr = logging.getLogger("topic_model")
logr.info("#"*15 + " started " + "#"*15)
##instance of class
checker=corpus()
logr.info("#"*15 + " SPARSE MATRIX (pre-filter)" + "#"*15)
##view sparse matrix and dictionary
for vector in checker:
logr.info(vector)
logr.info("#"*15 + " DICTIONARY (pre-filter)" + "#"*15)
logr.info(checker.dictionary)
logr.info(checker.dictionary.token2id)
#filter
checker.dictionary.filter_extremes(no_below=int(options.doc_freq_limit)+1)
logr.info("#"*15 + " DICTIONARY (post-filter)" + "#"*15)
logr.info(checker.dictionary)
logr.info(checker.dictionary.token2id)
##Create lda model
if options.create_model:
tfidf = models.TfidfModel(checker,normalize=False)
print tfidf
logr.info("#"*15 + " corpus_tfidf " + "#"*15)
corpus_tfidf = tfidf[checker]
logr.info("#"*15 + " lda " + "#"*15)
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
logr.info("#"*15 + " corpus_lda " + "#"*15)
corpus_lda = lda[corpus_tfidf]
##Evaluate topics based on threshold
scores = list(chain(*[[score for topic,score in topic] \
for topic in [doc for doc in corpus_lda]]))
threshold = sum(scores)/len(scores)
print "threshold:",threshold
print
cluster1 = [j for i,j in zip(corpus_lda,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(corpus_lda,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(corpus_lda,documents) if i[2][1] > threshold]
Run Code Online (Sandbox Code Playgroud)
结果topic-modeling-log文件如下。在此先感谢您的帮助!
2014-05-25 02:58:50,482 : INFO : ############### started ###############
2014-05-25 02:58:50,483 : INFO : ############### SPARSE MATRIX (pre-filter)###############
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,483 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,483 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (27, 1), (28, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(9, 1), (26, 1), (30, 1)]
2014-05-25 02:58:50,485 : INFO : ############### DICTIONARY (pre-filter)###############
2014-05-25 02:58:50,485 : INFO : Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : {'minors': 30, 'generation': 22, 'testing': 16, 'iv': 29, 'engineering': 15, 'computer': 2, 'relation': 20, 'human': 3, 'measurement': 18, 'unordered': 25, 'binary': 21, 'abc': 0, 'ordering': 31, 'graph': 26, 'system': 10, 'machine': 6, 'quasi': 32, 'random': 23, 'paths': 28, 'error': 17, 'trees': 24, 'lab': 5, 'applications': 1, 'management': 14, 'user': 12, 'interface': 4, 'intersection': 27, 'response': 8, 'perceived': 19, 'widths': 34, 'well': 33, 'eps': 13, 'survey': 9, 'time': 11, 'opinion': 7}
2014-05-25 02:58:50,486 : INFO : keeping 12 tokens which were in no less than 2 and no more than 4 (=50.0%) documents
2014-05-25 02:58:50,486 : INFO : resulting dictionary: Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : ############### DICTIONARY (post-filter)###############
2014-05-25 02:58:50,486 : INFO : Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : {'minors': 0, 'graph': 1, 'system': 2, 'trees': 3, 'eps': 4, 'computer': 5, 'survey': 6, 'user': 7, 'human': 8, 'time': 9, 'interface': 10, 'response': 11}
2014-05-25 02:58:50,486 : INFO : collecting document frequencies
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,486 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,486 : INFO : PROGRESS: processing document #0
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,486 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,488 : INFO : calculating IDF weights for 9 documents and 34 features (51 matrix non-zeros)
2014-05-25 02:58:50,488 : INFO : ############### corpus_tfidf ###############
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,488 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : ############### lda ###############
2014-05-25 02:58:50,489 : INFO : using symmetric alpha at 0.5
2014-05-25 02:58:50,489 : INFO : using serial LDA version on this node
2014-05-25 02:58:50,489 : WARNING : input corpus stream has no len(); counting documents
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,489 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,489 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,491 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50 with a convergence threshold of 0
2014-05-25 02:58:50,491 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,491 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'mac
这是由于使用的语料库和字典没有相同的 id-to-word 映射造成的。如果您修改字典并dictionary.compactify()在错误的时间调用,则可能会发生这种情况。
举个简单的例子就清楚了。让我们做一个字典:
from gensim.corpora.dictionary import Dictionary
documents = [
['here', 'is', 'one', 'document'],
['here', 'is', 'another', 'document'],
]
dictionary = Dictionary()
dictionary.add_documents(documents)
Run Code Online (Sandbox Code Playgroud)
这本词典现在有这些词的条目并将它们映射到整数 id。将文档转换为(id, count)元组向量很有用(我们希望在将它们传递到模型之前进行):
vectorized_corpus = [dictionary.doc2bow(doc) for doc in corpus]
Run Code Online (Sandbox Code Playgroud)
有时你会想改变你的字典。例如,您可能想要删除非常罕见或非常常见的词:
dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000)
dictionary.compactify()
Run Code Online (Sandbox Code Playgroud)
删除单词会在字典中产生空白,但调用会dictionary.compactify()重新分配 id 以填补空白。但这意味着我们vectorized_corpus上面的 id 不再使用相同的 id dictionary,如果我们将它们传递到模型中,我们将得到一个IndexError.
解决方案:在进行更改和调用后使用字典制作您的向量表示dictionary.compactify()!
| 归档时间: |
|
| 查看次数: |
2046 次 |
| 最近记录: |