标签: word2vec

Word2Vec:使用的窗口大小的影响

我试图在非常短的短语(5克)上训练word2vec模型.由于每个句子或例子都很短,我相信我可以使用的窗口大小最多可以是2.我试图理解这么小的窗口大小对学习模型的质量有什么影响,这样我才能理解我的模型是否学到了有意义的东西.我尝试在5克上训练word2vec模型,但似乎学习模型不能很好地捕获语义等.

我使用以下测试来评估模型的准确性:https: //code.google.com/p/word2vec/source/browse/trunk/questions-words.txt

我使用gensim.Word2Vec来训练模型,这里是我的准确度分数的片段(使用2的窗口大小)

[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
 {'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
 {'correct': 0, 'incorrect': 86, 'section': 'currency'},
 {'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
 {'correct': 123, 'incorrect': 183, 'section': 'family'},
 {'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
 {'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
 {'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
 {'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
 {'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
 {'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
 {'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
 {'correct': 56, …

Run Code Online (Sandbox Code Playgroud)

gensim word2vec

vvk*_*itk

lucky-day

10
推荐指数

2
解决办法

1万
查看次数

使用gensim的Word2vec培训在100K句子后开始交换

我正在尝试使用大约170K行的文件训练word2vec模型,每行一个句子.

我想我可能代表一个特殊的用例,因为"句子"有任意字符串而不是字典单词.每个句子(行)有大约100个单词,每个"单词"有大约20个字符,字符像"/"和数字.

训练代码非常简单:

# as shown in http://rare-technologies.com/word2vec-tutorial/
import gensim, logging, os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

current_dir = os.path.dirname(os.path.realpath(__file__))

# each line represents a full chess match
input_dir = current_dir+"/../fen_output"
output_file = current_dir+"/../learned_vectors/output.model.bin"

sentences = MySentences(input_dir)

model = gensim.models.Word2Vec(sentences,workers=8)

Run Code Online (Sandbox Code Playgroud)

事情是,事情真正快速达到100K句子(我的RAM稳步上升)然后我用完RAM而且我可以看到我的PC已经开始交换,并且训练停止了.我没有很多可用的RAM,只有大约4GB并word2vec在开始交换之前耗尽了所有内存.

我想OpenBLAS正确地链接到numpy:这就是numpy.show_config()告诉我的:

blas_info:
  libraries = ['blas']
  library_dirs = ['/usr/lib']
  language = f77 …

Run Code Online (Sandbox Code Playgroud)

python numpy blas gensim word2vec

Fel*_*ida

2015 06-26

10
推荐指数

1
解决办法

694
查看次数

在gensim中加载Word2Vec模型时出错

我正在AttributeError加载word2vec存储库中可用的gensim模型:

from gensim import models
w = models.Word2Vec()
w.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print w["queen"]

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-8219e36ba1f6> in <module>()
----> 1 w["queen"]

C:\Anaconda64\lib\site-packages\gensim\models\word2vec.pyc in __getitem__(self, word)
    761 
    762         """
--> 763         return self.syn0[self.vocab[word].index]
    764 
    765 

AttributeError: 'Word2Vec' object has no attribute 'syn0'

Run Code Online (Sandbox Code Playgroud)

这是一个已知的问题？

python gensim word2vec

Tar*_*ula

lucky-day

10
推荐指数

2
解决办法

2万
查看次数

是否有可能从python的句子语料库中重新训练word2vec模型(例如GoogleNews-vectors-negative300.bin)？

我正在使用预先训练的Google新闻数据集,通过在python中使用Gensim库来获取单词向量

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

Run Code Online (Sandbox Code Playgroud)

加载模型后,我将训练评论句子单词转换为向量

#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

Run Code Online (Sandbox Code Playgroud)

在word2Vec过程中,我的语料库中的单词出现了很多错误,这些错误不在模型中.问题是我如何重新训练已预先训练好的模型(例如GoogleNews-vectors-negative300.bin'),以获得那些遗失单词的单词向量.

以下是我的尝试:训练了我训练过的新模型

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window    size                                                                                    
downsampling = 1e-3 …

Run Code Online (Sandbox Code Playgroud)

python nlp gensim word2vec

Nom*_*uks

2016 03-14

10
推荐指数

2
解决办法

5972
查看次数

Gensim word2vec关于预定义的字典和单词索引数据

我需要使用gensim在推文上训练word2vec表示.与我在gensim上看到的大多数教程和代码不同,我的数据不是原始数据,但已经过预处理.我在包含65k字(包括"未知"标记和EOL标记)的文本文档中有一个字典,并且推文被保存为带有索引的numpy矩阵到这个字典中.下面是一个简单的数据格式示例:

dict.txt

you
love
this
code

Run Code Online (Sandbox Code Playgroud)

推文(5是未知的,6是EOL)

[[0, 1, 2, 3, 6],
 [3, 5, 5, 1, 6],
 [0, 1, 3, 6, 6]]

Run Code Online (Sandbox Code Playgroud)

我不确定如何处理索引表示.一种简单的方法就是将索引列表转换为字符串列表(即[0,1,2,3,6] - > ['0','1','2','3','6 '))当我把它读入word2vec模型时.然而,这必然是低效的,因为gensim然后将尝试查找用于例如'2'的内部索引.

如何使用gensim以有效的方式加载此数据并创建word2vec表示？

python nlp gensim word2vec

pir*_*pir

2016 03-13

10
推荐指数

1
解决办法

2574
查看次数

在Keras实现word2vec

我想在keras中实现word2vec算法,这可能吗？我该如何适应模型？我应该使用自定义丢失功能吗？

nlp theano word2vec deep-learning keras

And*_*rás

2017 12-21

10
推荐指数

1
解决办法

7979
查看次数

word2vec:CBOW和skip-gram性能训练数据集大小

问题很简单.哪个CBOW和skip-gram对于大数据集更好？(以及小数据集的答案如下.)

我很困惑,因为Mikolov本人,[Link]

Skip-gram:适用于少量训练数据,甚至代表罕见的单词或短语.

CBOW:训练比跳过快几倍,频繁单词的准确性略高

但是,谷歌TensorFlow,[链接]

CBOW对许多分布信息进行平滑(通过将整个上下文视为一个观察).在大多数情况下,这对于较小的数据集来说是有用的.

但是,skip-gram将每个上下文 - 目标对视为一个新的观察,当我们有更大的数据集时,这往往会做得更好.我们将在本教程的其余部分重点介绍skip-gram模型.

这是一个Quora帖子,支持第一个想法[Link],然后还有另一个Quora帖子,它暗示了第二个想法[Link] - 似乎可以从前面提到的可靠来源中得到.

或者就像Mikolov所说的那样:

总的来说,最好的做法是尝试一些实验,看看什么最适合你,因为不同的应用程序有不同的要求.

但肯定有关于此事的经验或分析判决或最终说法？

nlp word2vec word-embedding

Sea*_*ean

2016 09-19

9
推荐指数

1
解决办法

2567
查看次数

gensim word2vec中most_similar和similar_by_vector之间的区别？

我对来自 gensim 的 Word2vecKeyedVectors 的 most_similar 和 similar_by_vector 的结果感到困惑。他们应该以相同的方式计算余弦相似度 - 但是：

用一个词运行它们会得到相同的结果，例如：model.most_similar(['obama']) 和 similar_by_vector(model['obama'])

但如果我给它一个等式：

model.most_similar(positive=['king', 'woman'], negative=['man'])

Run Code Online (Sandbox Code Playgroud)

给出：

[('queen', 0.7515910863876343), ('monarch', 0.6741327047348022), ('princess', 0.6713887453079224), ('kings', 0.6698989868164062), ('kingdom', 0.5971318483352661), ('royal', 0.5921063423156738), ('uncrowned', 0.5911505818367004), ('prince', 0.5909028053283691), ('lady', 0.5904011130332947), ('monarchs', 0.5884358286857605)]

Run Code Online (Sandbox Code Playgroud)

同时与：

q = model['king'] - model['man'] + model['woman']
model.similar_by_vector(q)

Run Code Online (Sandbox Code Playgroud)

给出：

[('king', 0.8655095100402832), ('queen', 0.7673765420913696), ('monarch', 0.695580005645752), ('kings', 0.6929547786712646), ('princess', 0.6909604668617249), ('woman', 0.6528975963592529), ('lady', 0.6286187767982483), ('prince', 0.6222133636474609), ('kingdom', 0.6208546161651611), ('royal', 0.6090123653411865)]

Run Code Online (Sandbox Code Playgroud)

皇后、君主...等词的余弦距离存在显着差异。我想知道为什么？

谢谢！

nlp gensim word2vec

pei*_*aqi

lucky-day

9
推荐指数

1
解决办法

7996
查看次数

连接句子的图

我有几个主题（两个）的句子列表，如下所示：

Sentences
Trump says that it is useful to win the next presidential election. 
The Prime Minister suggests the name of the winner of the next presidential election.
In yesterday's conference, the Prime Minister said that it is very important to win the next presidential election. 
The Chinese Minister is in London to discuss about climate change.
The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement.
The president Donald …

Run Code Online (Sandbox Code Playgroud)

python nlp nltk networkx word2vec

sti*_*ing

2020 10-11

9
推荐指数

1
解决办法

392
查看次数