使用scikit-learn矢量化器和词汇表与gensim

emi*_*ara 19 python gensim topic-modeling scikit-learn

我试图用gensim主题模型回收scikit-learn矢量化器对象.原因很简单:首先,我已经有了大量的矢量化数据; 第二,我更喜欢scikit-learn矢量化器的界面和灵活性; 第三,尽管使用gensim的主题建模非常快,但Dictionary()根据我的经验计算其词典()相对较慢.

之前已经提出过类似的问题,特别是在这里这里,桥接解决方案是gensim的Sparse2Corpus()函数,它将Scipy稀疏矩阵转换为gensim语料库对象.

但是,此转换不使用vocabulary_sklearn矢量化程序的属性,该属性保存单词和要素ID之间的映射.为了打印每个主题的判别词,这种映射是必要的(id2word在gensim主题模型中,描述为"从单词id(整数)到单词(字符串)的映射").

我知道gensim的Dictionary对象比scikit vect.vocabulary_(一个简单的Python dict)复杂得多(而且计算速度慢)......

任何想法使用vect.vocabulary_id2word在gensim模式?

一些示例代码:

# our data
documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']

from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}

import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']
Run Code Online (Sandbox Code Playgroud)

Rad*_*dim 12

Gensim不需要Dictionary物体.只要将ids(整数)映射到单词(字符串),就dict可以id2word直接使用plain 作为输入.

事实上任何字典样都行(包括dict,Dictionary,SqliteDict...).

(顺便说一下,gensim Dictionary是一个简单的Python dict.不确定你的Dictionary性能评论来自哪里,你不能比dictPython中的普通版更快地获得映射.也许你会把它与文本预处理混淆(不是gensim的一部分),这确实很慢.)


emi*_*ara 7

为了提供最后一个例子,scikit-learn的矢量化器对象可以转换为gensim的语料库格式,Sparse2Corpus而词汇表dict可以通过简单地交换键和值来回收:

# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
    vocabulary_gensim[val] = key
Run Code Online (Sandbox Code Playgroud)

  • 或者只是`id2word = dict((v,k)代表k,v代表vect.vocabulary_.iteritems()) (2认同)

Jef*_*y04 6

我还在使用这两个运行一些代码实验。显然现在有一种方法可以从语料库构建字典

from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))
Run Code Online (Sandbox Code Playgroud)

然后您可以将此字典用于 tfidf、LSI 或 LDA 模型。