python nltk为wordnet相似性度量返回奇数结果

Question

python nltk为wordnet相似性度量返回奇数结果

我试图使用python nltk的wordnet找到两个单词之间的相似性.两个示例关键字是"游戏"和"leonardo".首先,我提取了这两个单词的所有同义词,并交叉匹配每个synset以找到它们的相似性.这是我的代码

from nltk.corpus import wordnet as wn

xx = wn.synsets("game")
yy = wn.synsets("leonardo")
for x in xx:
    for y in yy:
        print x.name
        print x.definition
        print y.name
        print y.definition
        print x.wup_similarity(y)
        print '\n'

Run Code Online (Sandbox Code Playgroud)

这是总产量:

game.n.01比赛规则决定胜利者leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.285714285714

game.n.02一项运动或其他比赛的单一游戏leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.285714285714

game.n.03娱乐或消遣leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.25

game.n.04动物猎食或运动leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.923076923077

game.n.05(网球)游戏的一个部门,其中一个玩家为leonardo.n.01服务.意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.222222222222

game.n.06(游戏)特定点的得分或赢得leonardo所需的得分.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.285714285714

game.n.07用于食物的野生动物的肉leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.5

plot.n.01一个做某事的秘密计划(特别是低手或非法的事)leonardo.n.01意大利画家兼雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.2

game.n.09为了玩特定游戏所需的游戏设备leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.666666666667

game.n.10你的职业或工作线leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.25

game.n.11轻浮或琐碎的行为leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)0.222222222222

bet_on.v.01赌leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)-1

残缺的脚或腿leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)-1

game.s.02愿意面对危险leonardo.n.01意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)-1

但是game.n.04和leonardo.n.01之间的相似之处真的很奇怪.我认为相似性(0.923076923077)不应该那么高.

game.n.04

动物追捕食物或运动

leonardo.n.01

意大利画家和雕塑家和工程师,科学家和建筑师; 意大利文艺复兴时期最多才多艺的天才(1452-1519)

0.923076923077

我的概念有问题吗？

Answer 1

Aya*_*Aya 8

根据文档,该wup_similarity()方法返回...

...基于分类法中的两种感官的深度以及他们的最小公共订户(最具体的祖先节点)的深度,表示两个词义的相似程度.

...和...

>>> from nltk.corpus import wordnet as wn
>>> game = wn.synset('game.n.04')
>>> leonardo = wn.synset('leonardo.n.01')
>>> game.lowest_common_hypernyms(leonardo)
[Synset('organism.n.01')]
>>> organism = game.lowest_common_hypernyms(leonardo)[0]
>>> game.shortest_path_distance(organism)
2
>>> leonardo.shortest_path_distance(organism)
3

Run Code Online (Sandbox Code Playgroud)

......这就是为什么它认为它们相似,尽管我得到......

>>> game.wup_similarity(leonardo)
0.7058823529411765

Run Code Online (Sandbox Code Playgroud)

......出于某种原因,这是不同的.

更新

我想要一些测量结果表明,相似性('游戏','国际象棋')远远小于相似性('游戏','leonardo')

这样的事情怎么样......

from nltk.corpus import wordnet as wn
from itertools import product

def compare(word1, word2):
    ss1 = wn.synsets(word1)
    ss2 = wn.synsets(word2)
    return max(s1.path_similarity(s2) for (s1, s2) in product(ss1, ss2))

for word1, word2 in (('game', 'leonardo'), ('game', 'chess')):
    print "Path similarity of %-10s and %-10s is %.2f" % (word1,
                                                          word2,
                                                          compare(word1, word2))

Run Code Online (Sandbox Code Playgroud)

...打印......

Path similarity of game       and leonardo   is 0.17
Path similarity of game       and chess      is 0.25

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，4 月前
查看次数：	2895 次
最近记录：	9 年，9 月前