相关疑难解决方法(0)

更新gensim word2vec模型

我在gensim中有一个word2vec模型,训练超过98892个文档.对于句子数组中不存在的任何给定句子(即我训练模型的集合),我需要用该句子更新模型,以便下次查询它会给出一些结果.我是这样做的:

new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)

Run Code Online (Sandbox Code Playgroud)

并将其打印为日志:

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s

Run Code Online (Sandbox Code Playgroud)

现在,当我使用类似的new_sentence查询大多数肯定(as model.most_similar(positive=new_sentence))时,它会发出错误:

Traceback (most recent call last):
 File "<pyshell#220>", line 1, in <module>
 model.most_similar(positive=['moscow', 'weather', 'cold'])
 File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
 raise KeyError("word '%s' not …

Run Code Online (Sandbox Code Playgroud)

gensim word2vec

use*_*542

lucky-day

29
推荐指数

3
解决办法

2万
查看次数

是否有可能从python的句子语料库中重新训练word2vec模型(例如GoogleNews-vectors-negative300.bin)？

我正在使用预先训练的Google新闻数据集,通过在python中使用Gensim库来获取单词向量

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

Run Code Online (Sandbox Code Playgroud)

加载模型后,我将训练评论句子单词转换为向量

#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

Run Code Online (Sandbox Code Playgroud)

在word2Vec过程中,我的语料库中的单词出现了很多错误,这些错误不在模型中.问题是我如何重新训练已预先训练好的模型(例如GoogleNews-vectors-negative300.bin'),以获得那些遗失单词的单词向量.

以下是我的尝试:训练了我训练过的新模型

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window    size                                                                                    
downsampling = 1e-3 …

Run Code Online (Sandbox Code Playgroud)

python nlp gensim word2vec

Nom*_*uks

2016 03-14

10
推荐指数

2
解决办法

5972
查看次数