标签: gensim

使用pandas数据帧获取tfidf的最简单方法是什么？

我想从下面的文档中计算tf-idf.我正在使用python和pandas.

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

Run Code Online (Sandbox Code Playgroud)

首先,我想我需要为每一行获取word_count.所以我写了一个简单的函数:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

Run Code Online (Sandbox Code Playgroud)

然后,我将它应用于每一行.

df['word_count'] = df['sent'].apply(word_count)

Run Code Online (Sandbox Code Playgroud)

但现在我迷路了.我知道如果我使用Graphlab,有一种简单的方法来计算tf-idf,但我想坚持使用开源选项.Sklearn和gensim都看起来势不可挡.获得tf-idf的最简单的解决方案是什么？

python tf-idf pandas gensim scikit-learn

use*_*952

lucky-day

20
推荐指数

2
解决办法

2万
查看次数

如何从gensim打印LDA主题模型？蟒蛇

使用gensim我能够从LSA中的一组文档中提取主题但是如何访问从LDA模型生成的主题？

打印lda.print_topics(10)代码时出现以下错误,因为print_topics()返回a NoneType:

Traceback (most recent call last):
  File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
    for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable

Run Code Online (Sandbox Code Playgroud)

代码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of …

Run Code Online (Sandbox Code Playgroud)

python nlp lda gensim topic-modeling

alv*_*vas

lucky-day

19
推荐指数

5
解决办法

3万
查看次数

使用scikit-learn矢量化器和词汇表与gensim

我试图用gensim主题模型回收scikit-learn矢量化器对象.原因很简单:首先,我已经有了大量的矢量化数据; 第二,我更喜欢scikit-learn矢量化器的界面和灵活性; 第三,尽管使用gensim的主题建模非常快,但Dictionary()根据我的经验计算其词典()相对较慢.

之前已经提出过类似的问题,特别是在这里和这里,桥接解决方案是gensim的Sparse2Corpus()函数,它将Scipy稀疏矩阵转换为gensim语料库对象.

但是,此转换不使用vocabulary_sklearn矢量化程序的属性,该属性保存单词和要素ID之间的映射.为了打印每个主题的判别词,这种映射是必要的(id2word在gensim主题模型中,描述为"从单词id(整数)到单词(字符串)的映射").

我知道gensim的Dictionary对象比scikit vect.vocabulary_(一个简单的Python dict)复杂得多(而且计算速度慢)......

任何想法使用vect.vocabulary_如id2word在gensim模式？

一些示例代码:

# our data
documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random …

Run Code Online (Sandbox Code Playgroud)

python gensim topic-modeling scikit-learn

emi*_*ara

2017 05-23

19
推荐指数

3
解决办法

6780
查看次数

给出最相似的单词,给出单词的向量(不是单词本身)

使用该gensim.models.Word2Vec库,您可以提供一个模型和一个"单词",您可以为其找到最相似的单词列表:

model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True)
model.most_similar(positive=[WORD], topn=N)

Run Code Online (Sandbox Code Playgroud)

我想知道是否有可能将系统作为输入模型和"向量",并要求系统返回顶部相似的单词(它们的向量非常接近给定的向量).类似的东西:

model.most_similar(positive=[VECTOR], topn=N)

Run Code Online (Sandbox Code Playgroud)

我需要这个功能用于双语设置,其中我有2个模型(英语和德语),以及一些英语单词,我需要找到他们最相似的德国候选人.我想要做的是从英语模型中获取每个英语单词的向量:

model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True)
vector_w_en=model_EN[WORD_EN]

Run Code Online (Sandbox Code Playgroud)

然后用这些向量查询德国模型.

model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True)
model_DE.most_similar(positive=[vector_w_en], topn=N)

Run Code Online (Sandbox Code Playgroud)

我已经使用word2vec包中的原始距离函数在C中实现了这一点.但是,现在我需要它在python中,以便能够将它与我的其他脚本集成.

你知道在gensim.models.Word2Vec库或其他类似的库中是否已经有一个方法可以做到这一点吗？我需要自己实施吗？

python gensim word2vec

ami*_*min

2016 12-17

19
推荐指数

1
解决办法

2万
查看次数

加载预先计算的向量Gensim

我使用Gensim Python包来学习神经语言模型,我知道你可以提供一个训练语料库来学习模型.然而,已经存在许多以文本格式可用的预计算单词向量(例如http://www-nlp.stanford.edu/projects/glove/).有没有办法初始化一个只使用一些预先计算的向量的Gensim Word2Vec模型,而不是从头开始学习向量？

谢谢!

python nlp gensim word2vec

MEr*_*ric

lucky-day

18
推荐指数

2
解决办法

1万
查看次数

Word2vec向量的长度有什么意义？

我正在使用Word2vec通过gensim与Google在Google新闻上训练的预训练矢量.我注意到我可以通过对Word2Vec对象进行直接索引查找来访问的单词vector 不是单位向量:

>>> import numpy
>>> from gensim.models import Word2Vec
>>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
>>> king_vector = w2v['king']
>>> numpy.linalg.norm(king_vector)
2.9022589

Run Code Online (Sandbox Code Playgroud)

但是,在该most_similar方法中,不使用这些非单位矢量; 相反,从未记录的.syn0norm属性中使用规范化版本,该属性仅包含单位向量:

>>> w2v.init_sims()
>>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index]
>>> numpy.linalg.norm(unit_king_vector)
0.99999994

Run Code Online (Sandbox Code Playgroud)

较大的向量只是单位向量的放大版本:

>>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector
array([  0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
         0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
        -7.45058060e-09,   0.00000000e+00,   3.72529030e-09,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        ... (some lines omitted) ...
        -1.86264515e-09,  -3.72529030e-09,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00, …

Run Code Online (Sandbox Code Playgroud)

python nlp gensim word2vec

Mar*_*ery

lucky-day

18
推荐指数

1
解决办法

6132
查看次数

在gensim Word2Vec模型中匹配单词和向量

我有gensim Word2Vec实现为我计算一些单词嵌入.据我所知,一切都非常奇妙; 现在我正在聚集创建的单词vector,希望得到一些语义分组.

下一步,我想看一下每个集群中包含的单词(而不是向量).即如果我有嵌入的向量[x, y, z],我想找出这个向量代表的实际单词.我可以通过调用model.vocab和单词向量来获得单词/ Vocab项目model.syn0.但我找不到这些明确匹配的位置.

这比我想象的要复杂得多,我觉得我可能会错过这种明显的做法.任何帮助表示赞赏!

问题:

将单词与嵌入向量相匹配Word2Vec ()- 如何进行？

我的方法:

在创建模型(下面的代码*)之后,我现在想要将分配给每个单词的索引(在build_vocab()阶段期间)与输出的向量矩阵相匹配model.syn0.从而

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up …

Run Code Online (Sandbox Code Playgroud)

python vector machine-learning gensim word2vec

pat*_*ick

2017 05-23

18
推荐指数

3
解决办法

1万
查看次数

解释文档中单词的TF-IDF分数之和

首先,让我们每个文档每个术语提取TF-IDF分数:

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
stoplist = …

Run Code Online (Sandbox Code Playgroud)

python statistics nlp tf-idf gensim

alv*_*vas

lucky-day

18
推荐指数

2
解决办法

5898
查看次数

在Gensim LDA中记录主题分布

我使用玩具语料库得出了一个LDA主题模型,如下所示:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for …

Run Code Online (Sandbox Code Playgroud)

python lda gensim

Mos*_* Xu

lucky-day

17
推荐指数

2
解决办法

1万
查看次数

Python:gensim:RuntimeError:在训练模型之前必须首先构建词汇表

我知道这问题已经被提出,但我仍然无法找到解决方案.

我想word2vec在自定义数据集上使用gensim ,但现在我仍然在弄清楚数据集必须采用的格式.我看了一下这篇文章,其中输入基本上是一个列表列表(一个包含其他列表的大列表,这些列表是来自NLTK Brown语料库的标记化句子).所以我认为这是我必须用于命令的输入格式word2vec.Word2Vec().但是,它不适用于我的小测试集,我不明白为什么.

我尝试过的:

这有效:

from gensim.models import word2vec
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

brown_vecs = word2vec.Word2Vec(brown.sents())

Run Code Online (Sandbox Code Playgroud)

这不起作用:

sentences = [ "the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
vocab = [s.encode('utf-8').split() for s in sentences]
voc_vec = word2vec.Word2Vec(vocab)

Run Code Online (Sandbox Code Playgroud)

我不明白为什么它不适用于"模拟"数据,即使它具有与布朗语料库中的句子相同的数据结构:

词汇:

[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', …

Run Code Online (Sandbox Code Playgroud)

python gensim word2vec

use*_*591

lucky-day

17
推荐指数

2
解决办法

2万
查看次数

标签统计

gensim ×10

python ×10

word2vec ×5

nlp ×4

lda ×2

scikit-learn ×2

tf-idf ×2

topic-modeling ×2

machine-learning ×1

pandas ×1

statistics ×1

vector ×1

问题:

我的方法:

标签 统计

标签统计