向量空间模型:余弦相似度与欧几里德距离

Ant*_*nin 39 trigonometry vector distance euclidean-distance

我有分类文本的语料库.从这些我创建矢量.每个向量对应一个文档.矢量分量是本文档中的字权重,计算为TFIDF值.接下来,我构建一个模型,其中每个类都由一个向量表示.模型具有与语料库中的类一样多的向量.模型矢量的分量被计算为取自该类中矢量的所有分量值的平均值.对于未分类的矢量,我通过计算这些矢量之间的余弦来确定与模型矢量的相似性.

问题:

1)我可以使用未分类和模型向量之间的欧几里德距离来计算它们的相似性吗?

2)为什么欧几里德距离不能用作相似度量而不是两个矢量之间的角度余弦,反之亦然?

谢谢!

kiz*_*zx2 39

考虑这一点的一种非正式但相当直观的方法是考虑矢量的2个分量:方向幅度.

方向是向量的"偏好"/"风格"/"情感"/"潜在变量",而幅度是朝向该方向的强度.

在对文档进行分类时,我们希望根据整体情绪对它们进行分类,因此我们使用角距离.

欧几里德距离很容易受到L2规范(大小,在二维情况下)而不是方向聚类的文档的影响.即具有完全不同方向的矢量将被聚类,因为它们与原点的距离是相似的.

  • @xenocyon考虑它们对原点的大小很小的情况 (3认同)
  • “ [具有欧几里得距离]的方向完全不同的向量将被聚类,因为它们与原点的距离相似”->这是怎么回事?在极端情况下,请考虑两个大小完全相同的完全相反的向量:即使它们到原点的距离相同,它们之间的欧氏距离也会很大。 (2认同)
  • 如果您的三个文档的情绪分别为-1、1、100,那么哪两个更接近:前两个还是后两个?我认为只有在知道您要解决的特定问题时才可以回答。 (2认同)

Tys*_*son 23

我会以相反的顺序回答问题.对于第二个问题,余弦相似度和欧几里德距离是测量矢量相似性的两种不同方法.前者测量矢量相对于原点的相似性,而后者测量沿矢量的特定兴趣点之间的距离.您可以单独使用它们,将它们组合使用并使用它们,或者查看确定相似性的许多其他方法之一.有关详细信息,请参阅Michael Collins讲座中的这些幻灯片.

你的第一个问题不是很清楚,但无论你是在比较文档还是你的"模型"(传统上将其描述为集群, model是所有集群的总和).


alv*_*vas 5

计算时间明智(in python):

import time
import numpy as np

for i in range(10):
    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b))
    print 'Cosine similarity took', time.time() - start

    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        2 * (1 - np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b)))
    print 'Euclidean from 2*(1 - cosine_similarity) took', time.time() - start


    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.linalg.norm(a-b)
    print 'Euclidean Distance using np.linalg.norm() took', time.time() - start


    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.sqrt(np.sum((a-b)**2))
    print 'Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took', time.time() - start
    print '--------------------------------------------------------'
Run Code Online (Sandbox Code Playgroud)

[OUT]:

Cosine similarity took 0.15826010704
Euclidean from 2*(1 - cosine_similarity) took 0.179041862488
Euclidean Distance using np.linalg.norm() took 0.10684299469
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.113723039627
--------------------------------------------------------
Cosine similarity took 0.161732912064
Euclidean from 2*(1 - cosine_similarity) took 0.178358793259
Euclidean Distance using np.linalg.norm() took 0.107393980026
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111194849014
--------------------------------------------------------
Cosine similarity took 0.16274189949
Euclidean from 2*(1 - cosine_similarity) took 0.178978919983
Euclidean Distance using np.linalg.norm() took 0.106336116791
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111373186111
--------------------------------------------------------
Cosine similarity took 0.161939144135
Euclidean from 2*(1 - cosine_similarity) took 0.177414178848
Euclidean Distance using np.linalg.norm() took 0.106301784515
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.11181807518
--------------------------------------------------------
Cosine similarity took 0.162333965302
Euclidean from 2*(1 - cosine_similarity) took 0.177582979202
Euclidean Distance using np.linalg.norm() took 0.105742931366
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111120939255
--------------------------------------------------------
Cosine similarity took 0.16153883934
Euclidean from 2*(1 - cosine_similarity) took 0.176836967468
Euclidean Distance using np.linalg.norm() took 0.106392860413
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110891103745
--------------------------------------------------------
Cosine similarity took 0.16018986702
Euclidean from 2*(1 - cosine_similarity) took 0.177738189697
Euclidean Distance using np.linalg.norm() took 0.105060100555
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110497951508
--------------------------------------------------------
Cosine similarity took 0.159607887268
Euclidean from 2*(1 - cosine_similarity) took 0.178565979004
Euclidean Distance using np.linalg.norm() took 0.106383085251
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.11084485054
--------------------------------------------------------
Cosine similarity took 0.161075115204
Euclidean from 2*(1 - cosine_similarity) took 0.177822828293
Euclidean Distance using np.linalg.norm() took 0.106630086899
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110257148743
--------------------------------------------------------
Cosine similarity took 0.161051988602
Euclidean from 2*(1 - cosine_similarity) took 0.181928873062
Euclidean Distance using np.linalg.norm() took 0.106360197067
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111301898956
--------------------------------------------------------
Run Code Online (Sandbox Code Playgroud)

  • 从这些结果来看,计算时间之间似乎没有显着差异.因此,在决定使用方法时,不能通过计算时间来指导. (2认同)