Python NLTK WUP相似性对于完全相同的单词,得分并不统一

Pro*_*ies 7 python nlp similarity nltk

如下的简单代码给出了两种情况下0.75的相似性得分.你可以看到两个词完全相同.为了避免任何混淆,我还将一个单词与自身进行了比较.得分拒绝从0.75膨胀.这里发生了什么?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity
Run Code Online (Sandbox Code Playgroud)

alv*_*vas 8

这是一个有趣的问题.

TL; DR:

对不起,这个问题没有简短的答案=(


太久了,想读:

看一下代码wup_similarity(),问题来自于相似度计算,而是来自NLTK遍历WordNet层次结构的方式lowest_common_hypernym()(参见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet) .py#L805).

通常,synset与其自身之间的最低常见上位词必须是它自己:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]
Run Code Online (Sandbox Code Playgroud)

但在orange它的情况下也给出fruit了:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]
Run Code Online (Sandbox Code Playgroud)

我们必须lowest_common_hypernym()https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805的文档字符串中查看代码.

获取两个同义词具有的最低同义词列表作为上位词.如果use_min_depth == False这意味着返回了作为两者的上位词self并且other具有最低最大深度的同义词集合,或者如果在相同深度处存在多个这样的同义词集合,则它们都被返回但是,如果use_min_depth == True那么具有/具有返回最低最小深度并在两个路径中出现

所以让我们尝试lowest_common_hypernym()use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]
Run Code Online (Sandbox Code Playgroud)

似乎这解决了绑定路径的模糊性.但wup_similarity()API没有use_min_depth参数:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'
Run Code Online (Sandbox Code Playgroud)

请注意,差异在于,use_min_depth==False最低_common_hypernym在遍历同义词时检查最大深度.但是,当use_min_depth==True它检查最小深度时,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

因此,如果我们跟踪lowest_common_hypernym代码:

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]

# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]
Run Code Online (Sandbox Code Playgroud)

这个奇怪的现象wup_similarity实际上在代码注释中突出显示,https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)
Run Code Online (Sandbox Code Playgroud)

当在https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843中选择列表中的第一个辅助用户时:

subsumer = subsumers[0]
Run Code Online (Sandbox Code Playgroud)

当然,在橙色synset的情况下,首先选择水果感觉它是列出最低常见上位词的列表中的第一个.

总而言之,默认参数是一种功能,而不是像NLTK v2.x那样保持重现性的错误.

因此解决方案可能是手动更改NLTK源以强制use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845


EDITED

要解决此问题,您可以对同一个synset进行临时检查:

def wup_similarity_hacked(synset1, synset2):
  if synset1 == synset2:
    return 1.0
  else:
    return synset1.wup_similarity(synset2)
Run Code Online (Sandbox Code Playgroud)