两个句子之间的空间性，奇怪相似性

Question

两个句子之间的空间性，奇怪相似性

我已经下载了en_core_web_lg模型，并试图找到两个句子之间的相似之处：

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

Run Code Online (Sandbox Code Playgroud)

返回非常奇怪的值：

0.9066019751888448

Run Code Online (Sandbox Code Playgroud)

这两个句子不应具有90％的相似性，它们具有非常不同的含义。

为什么会这样呢？为了使相似度结果更合理，是否需要添加某种附加词汇？

Answer 1

Joh*_*ter 10

Spacy通过平均单词嵌入来构造句子嵌入。因为在一个普通的句子中有很多无意义的单词（称为停用词），所以您得到的效果很差。您可以这样删除它们：

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

Run Code Online (Sandbox Code Playgroud)

或仅保留名词，因为它们具有最多的信息：

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']))

Run Code Online (Sandbox Code Playgroud)

通过阅读本文和其他内容，它澄清了我的误解，即在文档相似性中删除了停用词。这个特定的答案很棒，因为它专注于实际内容，同时减少干扰词并使相似度计算更快。 (3认同)

Answer 2

den*_*ger 8

该Spacy文档矢量的相似度说明了它的基本思想是：
每个字都有一个向量表示，通过上下文的嵌入（学习Word2Vec），被培养的语料库，如文档中说明。

现在，完整句子的单词嵌入只是所有不同单词的平均值。如果您现在有很多单词在语义上位于同一区域（例如，诸如“ he”，“ was”，“ this”，...等填充词）和附加词汇“ cancel out”，那么您最终可能会出现与您的情况相似的情况。

问题是正确的，您可以做什么：从我的角度来看，您可以想出一个更复杂的相似性度量。随着search_doc和main_doc具有其他信息，例如原始句子，您可以通过长度差罚分来修改向量，或者尝试比较句子的较短部分，并计算成对相似度（然后，问题是比较哪个部分）。

遗憾的是，目前还没有一种简单的方法可以简单地解决此问题。

Answer 3

Mar*_*sio 8

正如其他人所指出的，您可能想要使用 Universal Sentence Encoder 或 Infersent。

对于 Universal Sentence Encoder，您可以安装管理 TFHub 包装的预构建 SpaCy 模型，这样您只需要安装带有 pip 的包，向量和相似度就会按预期工作。

您可以按照此存储库的说明（我是作者）https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub

安装模型： pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub/releases/download/en_use_md-0.2.0/en_use_md-0.2.0.tar.gz#en_use_md-0.2.0
加载和使用模型

import spacy
# this loads the wrapper
nlp = spacy.load('en_use_md')

# your sentences
search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))
# this will print 0.310783598221594

Run Code Online (Sandbox Code Playgroud)

请透露您是所提到的软件包的作者（尽管这很明显） (2认同)

Answer 4

Dia*_*Kap 6

现在SpaCy 官方网站上提供了通用句子编码器： https://spacy.io/universe/project/spacy-universal-sentence-encoder

1、安装：

pip install spacy-universal-sentence-encoder

Run Code Online (Sandbox Code Playgroud)

2.代码示例：

import spacy_universal_sentence_encoder
# load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')
# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

Run Code Online (Sandbox Code Playgroud)

Answer 5

小智 5

正如@dennlinger 所指出的，Spacy 的句子嵌入只是单独获取的所有词向量嵌入的平均值。因此，如果你有一个带有否定词（例如“好”和“坏”）的句子，它们的向量可能会相互抵消，从而导致上下文嵌入不太好。如果您的用例特定于获取句子嵌入，那么您应该尝试以下 SOTA 方法。

谷歌的通用句子编码器：https ://tfhub.dev/google/universal-sentence-encoder/2
Facebook 的 Infersent 编码器： https: //github.com/facebookresearch/InferSent

我已经尝试过这两种嵌入，并在大多数情况下为您提供了良好的结果，并使用单词嵌入作为构建句子嵌入的基础。

干杯!

归档时间：	7 年，2 月前
查看次数：	5436 次
最近记录：	6 年，1 月前