如何在Python中使用WordNet获取单词域?

Mad*_*dan 9 python nltk wordnet

如何使用nltk Python模块和WordNet找到单词域?

假设我有像(交易,需求汇票,支票,存折)这样的词,所有这些词的域名都是"BANK".我们如何在Python中使用nltk和WordNet来实现这一目标?

我正在尝试通过hypernym和hyponym关系:

例如:

from nltk.corpus import wordnet as wn
sports = wn.synset('sport.n.01')
sports.hyponyms()
[Synset('judo.n.01'), Synset('athletic_game.n.01'), Synset('spectator_sport.n.01'),    Synset('contact_sport.n.01'), Synset('cycling.n.01'), Synset('funambulism.n.01'), Synset('water_sport.n.01'), Synset('riding.n.01'), Synset('gymnastics.n.01'), Synset('sledding.n.01'), Synset('skating.n.01'), Synset('skiing.n.01'), Synset('outdoor_sport.n.01'), Synset('rowing.n.01'), Synset('track_and_field.n.01'), Synset('archery.n.01'), Synset('team_sport.n.01'), Synset('rock_climbing.n.01'), Synset('racing.n.01'), Synset('blood_sport.n.01')]
Run Code Online (Sandbox Code Playgroud)

bark = wn.synset('bark.n.02')
bark.hypernyms()
[Synset('noise.n.01')]
Run Code Online (Sandbox Code Playgroud)

alv*_*vas 12

普林斯顿WordNet中没有明确的域信息,也没有NLTK的WN API.

我建议您获取WordNet域资源的副本,然后使用域链接您的同义词集,请参阅http://wndomains.fbk.eu/

在您注册并完成下载后,您将看到一个wn-domains-3.2-20070223文本文件,这是一个制表符分隔的文件,第一列是offset-PartofSpeech标识符,第二列包含以空格分隔的域标记,例如

00584282-v  military pedagogy
00584395-v  military school university
00584526-v  animals pedagogy
00584634-v  pedagogy
00584743-v  school university
00585097-v  school university
00585271-v  pedagogy
00585495-v  pedagogy
00585683-v  psychological_features
Run Code Online (Sandbox Code Playgroud)

然后使用以下脚本访问synsets的域:

from collections import defaultdict
from nltk.corpus import wordnet as wn

# Loading the Wordnet domains.
domain2synsets = defaultdict(list)
synset2domains = defaultdict(list)
for i in open('wn-domains-3.2-20070223', 'r'):
    ssid, doms = i.strip().split('\t')
    doms = doms.split()
    synset2domains[ssid] = doms
    for d in doms:
        domain2synsets[d].append(ssid)

# Gets domains given synset.
for ss in wn.all_synsets():
    ssid = str(ss.offset).zfill(8) + "-" + ss.pos()
    if synset2domains[ssid]: # not all synsets are in WordNet Domain.
        print ss, ssid, synset2domains[ssid]

# Gets synsets given domain.        
for dom in sorted(domain2synsets):
    print dom, domain2synsets[dom][:3]
Run Code Online (Sandbox Code Playgroud)

同时寻找对于wn-affect消除WordNet域资源中的情感词语非常有用的内容.


随着更新的NLTK v3.0,它带有Open Multilingual WordNet(http://compling.hss.ntu.edu.sg/omw/),并且由于法语同义词共享相同的偏移ID,您可以简单地使用WND作为一种跨语言资源.法语引理名称可以这样访问:

# Gets domains given synset.
for ss in wn.all_synsets():
    ssid = str(ss.offset()).zfill(8) + "-" + ss.pos()
    if synset2domains[ssid]: # not all synsets are in WordNet Domain.
        print ss, ss.lemma_names('fre'), ssid, synset2domains[ssid]
Run Code Online (Sandbox Code Playgroud)

请注意,最新版本的NLTK将synset属性更改为"get"函数:Synset.offset- >Synset.offset()