word2vec,总和或平均字嵌入?

Dav*_*sta 7 cosine-similarity word2vec sentence-similarity

我使用word2vec来表示一个小短语(3到4个单词)作为一个独特的向量,通过添加每个单独的嵌入或通过计算单词嵌入的平均值.

从我做过的实验中,我总是得到相同的余弦相似度.我怀疑它与训练后word2vec生成的单词长度(单位长度(Euclidean norm))有关吗?或者我在代码中有BUG,或者我遗漏了一些东西.

这是代码:

import numpy as np
from nltk import PunktWordTokenizer
from gensim.models import Word2Vec
from numpy.linalg import norm
from scipy.spatial.distance import cosine

def pattern2vector(tokens, word2vec, AVG=False):
    pattern_vector = np.zeros(word2vec.layer1_size)
    n_words = 0
    if len(tokens) > 1:
        for t in tokens:
            try:
                vector = word2vec[t.strip()]
                pattern_vector = np.add(pattern_vector,vector)
                n_words += 1
            except KeyError, e:
                continue
        if AVG is True:
            pattern_vector = np.divide(pattern_vector,n_words)
    elif len(tokens) == 1:
        try:
            pattern_vector = word2vec[tokens[0].strip()]
        except KeyError:
            pass
    return pattern_vector


def main():
    print "Loading word2vec model ...\n"
    word2vecmodelpath = "/data/word2vec/vectors_200.bin"
    word2vec = Word2Vec.load_word2vec_format(word2vecmodelpath, binary=True)
    pattern_1 = 'founder and ceo'
    pattern_2 = 'co-founder and former chairman'

    tokens_1 = PunktWordTokenizer().tokenize(pattern_1)
    tokens_2 = PunktWordTokenizer().tokenize(pattern_2)
    print "vec1", tokens_1
    print "vec2", tokens_2

    p1 = pattern2vector(tokens_1, word2vec, False)
    p2 = pattern2vector(tokens_2, word2vec, False)
    print "\nSUM"
    print "dot(vec1,vec2)", np.dot(p1,p2)
    print "norm(p1)", norm(p1)
    print "norm(p2)", norm(p2)
    print "dot((norm)vec1,norm(vec2))", np.dot(norm(p1),norm(p2))
    print "cosine(vec1,vec2)",     np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2)))
    print "\n"
    print "AVG"
    p1 = pattern2vector(tokens_1, word2vec, True)
    p2 = pattern2vector(tokens_2, word2vec, True)
    print "dot(vec1,vec2)", np.dot(p1,p2)
    print "norm(p1)", norm(p1)
    print "norm(p2)", norm(p2)
    print "dot(norm(vec1),norm(vec2))", np.dot(norm(p1),norm(p2))
    print "cosine(vec1,vec2)",     np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2)))


if __name__ == "__main__":
    main()
Run Code Online (Sandbox Code Playgroud)

这是输出:

Loading word2vec model ...

Dimensions 200
vec1 ['founder', 'and', 'ceo']
vec2 ['co-founder', 'and', 'former', 'chairman']

SUM
dot(vec1,vec2) 5.4008677771
norm(p1) 2.19382594282
norm(p2) 2.87226958166
dot((norm)vec1,norm(vec2)) 6.30125952303
cosine(vec1,vec2) 0.857109242583


AVG
dot(vec1,vec2) 0.450072314758
norm(p1) 0.731275314273
norm(p2) 0.718067395416
dot(norm(vec1),norm(vec2)) 0.525104960252
cosine(vec1,vec2) 0.857109242583
Run Code Online (Sandbox Code Playgroud)

我正在使用Cosine Similarity(维基百科)中定义的余弦相似度.规范和点积的价值确实不同.

谁能解释为什么余弦是一样的?

谢谢你,大卫

小智 8

余弦测量两个向量之间的角度,并不考虑任一向量的长度.当您除以短语的长度时,您只是缩短向量,而不是改变其角度位置.所以你的结果对我来说是正确的.