使用GenSim的短语之间的语义相似性

use*_*157 4 nltk python-3.x gensim

背景

我试图判断一个短语是否在语义上与使用Gensim在语料库中找到的其他单词相关.例如,这里是预先标记化的语料库文档:

 **Corpus**
 Car Insurance
 Car Insurance Coverage
 Auto Insurance
 Best Insurance
 How much is car insurance
 Best auto coverage
 Auto policy
 Car Policy Insurance
Run Code Online (Sandbox Code Playgroud)

我的代码(基于此gensim教程)使用余弦相似性对语料库中的所有字符串判断短语的语义相关性.

问题

似乎如果查询包含在我的字典中找到的任何术语,则该短语被判断为在语义上与语料库相似(例如**长颈鹿Poop Car Murderer具有1的余弦相似性,但应该在语义上不相关).我不知道如何解决这个问题.

#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
        for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
        for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]  
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

#convert the query to LSI space
vec_lsi = lsi[vec_bow]              
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
Run Code Online (Sandbox Code Playgroud)

小智 7

首先,您不是直接比较词袋向量的余弦相似性,而是首先通过应用潜在语义分析(https://en.wikipedia.org/wiki/Latent_semantic_analysis)来降低文档向量的维度.这很好,但我只是想强调一下.通常假设语料库的底层语义空间的维度低于唯一令牌的数量.因此,LSA对向量空间应用主成分分析,并仅保留向量空间中包含最大方差的方向(即空间中变化最快的那些方向,因此假定包含更多信息).这受num_topics传递给LsiModel构造函数的参数的影响.

其次,我稍微清理了你的代码并嵌入了语料库:

# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1

from gensim import corpora, models, similarities
from collections import defaultdict

documents = [
    'Car Insurance',  # doc_id 0
    'Car Insurance Coverage',  # doc_id 1
    'Auto Insurance',  # doc_id 2
    'Best Insurance',  # doc_id 3
    'How much is car insurance',  # doc_id 4
    'Best auto coverage',  # doc_id 5
    'Auto policy',  # doc_id 6
    'Car Policy Insurance',  # doc_id 7
]

stoplist = set(['is', 'how'])

texts = [[word.lower() for word in document.split()
          if word.lower() not in stoplist]
         for document in documents]

print texts
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print sims
Run Code Online (Sandbox Code Playgroud)

如果我运行上面的命令,我得到以下输出:

[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]
Run Code Online (Sandbox Code Playgroud)

其中该列表中的每个条目对应(doc_id, cosine_similarity)于按降序排列的余弦相似性.

与查询文档中一样,实际上是词汇表(从语料库构建)的唯一单词是car,所有其他标记都将被删除.因此,您对模型的查询包含单例文档car.因此,您可以看到包含的所有文档car都与输入查询非常相似.

文档#3(Best Insurance)被高度排名的原因是因为令牌insurance经常与car(您的查询)共同发生.这正是分布式语义背后的推理,即"一个词的特点是它所保留的公司"(Firth,JR 1957).