我在gensim中有一个word2vec模型,训练超过98892个文档.对于句子数组中不存在的任何给定句子(即我训练模型的集合),我需要用该句子更新模型,以便下次查询它会给出一些结果.我是这样做的:
new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)
Run Code Online (Sandbox Code Playgroud)
并将其打印为日志:
2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s
Run Code Online (Sandbox Code Playgroud)
现在,当我使用类似的new_sentence查询大多数肯定(as model.most_similar(positive=new_sentence))时,它会发出错误:
Traceback (most recent call last):
File "<pyshell#220>", line 1, in <module>
model.most_similar(positive=['moscow', 'weather', 'cold'])
File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
raise KeyError("word '%s' not …Run Code Online (Sandbox Code Playgroud) 我正在使用预先训练的Google新闻数据集,通过在python中使用Gensim库来获取单词向量
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
Run Code Online (Sandbox Code Playgroud)
加载模型后,我将训练评论句子单词转换为向量
#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
Run Code Online (Sandbox Code Playgroud)
在word2Vec过程中,我的语料库中的单词出现了很多错误,这些错误不在模型中.问题是我如何重新训练已预先训练好的模型(例如GoogleNews-vectors-negative300.bin'),以获得那些遗失单词的单词向量.
以下是我的尝试:训练了我训练过的新模型
# Set values for various parameters
num_features = 300 # Word vector dimensionality
min_word_count = 10 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 …Run Code Online (Sandbox Code Playgroud)