etz*_*rid 15 python nlp nltk wordnet
我正在使用nltk的wordnet API.当我将一个synset与另一个synset进行比较时,我得到了None
但是当我比较它们时,我得到一个浮点值.
他们不应该给出相同的价值吗?有解释还是这是wordnet的错误?
例:
wn.synset('car.n.01').path_similarity(wn.synset('automobile.v.01')) # None
wn.synset('automobile.v.01').path_similarity(wn.synset('car.n.01')) # 0.06666666666666667
Run Code Online (Sandbox Code Playgroud)
alv*_*vas 16
从技术上讲,没有虚拟根,两者car
和automobile
synset都没有相互链接:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synset('car.n.01')
>>> y = wn.synset('automobile.v.01')
>>> print x.shortest_path_distance(y)
None
>>> print y.shortest_path_distance(x)
None
Run Code Online (Sandbox Code Playgroud)
现在,让我们仔细看看虚拟根问题.首先,NLTK中有一个简洁的函数,表明synset是否需要虚拟根:
>>> x._needs_root()
False
>>> y._needs_root()
True
Run Code Online (Sandbox Code Playgroud)
接下来,当您查看path_similarity
代码(http://nltk.googlecode.com/svn-/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.path_similarity)时,您可以看到:
def path_similarity(self, other, verbose=False, simulate_root=True):
distance = self.shortest_path_distance(other, \
simulate_root=simulate_root and self._needs_root())
if distance is None or distance < 0:
return None
return 1.0 / (distance + 1)
Run Code Online (Sandbox Code Playgroud)
因此,对于automobile
同义词集,该参数simulate_root=simulate_root and self._needs_root()
将永远True
在尝试y.path_similarity(x)
,当你尝试x.path_similarity(y)
它总是会False
因为x._needs_root()
是False
:
>>> True and y._needs_root()
True
>>> True and x._needs_root()
False
Run Code Online (Sandbox Code Playgroud)
现在当path_similarity()
传递给shortest_path_distance()
(https://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.shortest_path_distance)然后到hypernym_distances()
,它会尝试调用一个上位词列表来检查它们的距离,没有simulate_root = True
,automobile
synset将不会连接到car
反之亦然,反之亦然:
>>> y.hypernym_distances(simulate_root=True)
set([(Synset('automobile.v.01'), 0), (Synset('*ROOT*'), 2), (Synset('travel.v.01'), 1)])
>>> y.hypernym_distances()
set([(Synset('automobile.v.01'), 0), (Synset('travel.v.01'), 1)])
>>> x.hypernym_distances()
set([(Synset('object.n.01'), 8), (Synset('self-propelled_vehicle.n.01'), 2), (Synset('whole.n.02'), 8), (Synset('artifact.n.01'), 7), (Synset('physical_entity.n.01'), 10), (Synset('entity.n.01'), 11), (Synset('object.n.01'), 9), (Synset('instrumentality.n.03'), 5), (Synset('motor_vehicle.n.01'), 1), (Synset('vehicle.n.01'), 4), (Synset('entity.n.01'), 10), (Synset('physical_entity.n.01'), 9), (Synset('whole.n.02'), 7), (Synset('conveyance.n.03'), 5), (Synset('wheeled_vehicle.n.01'), 3), (Synset('artifact.n.01'), 6), (Synset('car.n.01'), 0), (Synset('container.n.01'), 4), (Synset('instrumentality.n.03'), 6)])
Run Code Online (Sandbox Code Playgroud)
所以理论上,右边path_similarity
是0/None,但由于simulate_root=simulate_root and self._needs_root()
参数,
nltk.corpus.wordnet.path_similarity()
在NLTK的API不是可交换的.
但是代码也没有错误/错误,因为通过遍历根的任何synset距离的比较将是恒定的,因为假人的位置*ROOT*
永远不会改变,所以最好的做法是这样做来计算path_similarity:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synset('car.n.01')
>>> y = wn.synset('automobile.v.01')
# When you NEVER want a non-zero value, since going to
# the *ROOT* will always get you some sort of distance
# from synset x to synset y
>>> max(wn.path_similarity(x,y), wn.path_similarity(y,x))
# when you can allow None in synset similarity comparison
>>> min(wn.path_similarity(x,y), wn.path_similarity(y,x))
Run Code Online (Sandbox Code Playgroud)
我不认为这是wordnet本身的错误.在您的情况下,汽车被指定为动词和汽车作为名词,因此您需要查看同义词集以查看图表的外观并确定网络是否正确标记.
A = 'car.n.01'
B = 'automobile.v.01'
C = 'automobile.n.01'
wn.synset(A).path_similarity(wn.synset(B))
wn.synset(B).path_similarity(wn.synset(A))
wn.synset(A).path_similarity(wn.synset(C)) # is 1
wn.synset(C).path_similarity(wn.synset(A)) # is also 1
Run Code Online (Sandbox Code Playgroud)