估计句子之间"近似"语义相似性的好方法是什么？

Question

估计句子之间"近似"语义相似性的好方法是什么？

Leg*_*end 18 python nlp machine-learning data-mining nltk

在过去的几个小时里,我一直在寻找SO上的nlp标签,我相信我没有错过任何东西,但如果我这样做,请指出我的问题.

但与此同时,我将描述我想要做的事情.我在许多帖子中观察到的一个常见概念是语义相似性很难.例如,从这篇文章中,接受的解决方案建议如下:

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

Run Code Online (Sandbox Code Playgroud)

我的高级要求是利用k-means聚类并基于语义相似性对文本进行分类,因此我需要知道的是它们是否是近似匹配.例如,在上面的例子中,我可以将1,2,4,5分类为一个类别而将3分类为另一个类别(当然,3个将用一些更相似的句子进行备份).有点像,找到相关的文章,但他们不必100%相关.

我想我最终需要构造每个句子的矢量表示,有点像它的指纹,但这个矢量应该包含的确切内容对我来说仍然是一个悬而未决的问题.它是n-gram,还是来自wordnet的东西,还是个别词干或其他东西？

这个帖子在枚举所有相关技术方面做得非常出色,但不幸的是,当帖子达到我想要的时候就停止了.有关该领域最新技术水平的建议吗？

Answer 1

si2*_*19e 5

潜在语义建模可能很有用.它基本上只是奇异值分解的另一个应用.该SVDLIBC是一个相当不错的C实现这种方法,这是一个过时的歌曲,但礼包的,甚至有蟒蛇的形式结合sparsesvd.

归档时间：	14 年，4 月前
查看次数：	2204 次
最近记录：	14 年，4 月前