如果手动加载 wordnet,如何在同义词集 (nltk) 中使用语言选项?

Sté*_*e C 5 python nlp path nltk wordnet

出于特定目的,我必须使用 Wordnet 1.6 而不是 nltk 包中实现的当前版本。然后我在这里下载了旧版本并尝试使用 french 选项运行一个简单的代码提取。

from collections import defaultdict
import nltk
#nltk.download() 
import os
import sys
from nltk.corpus import WordNetCorpusReader

cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"
wn16_path = "{0}/dict".format(wordnet16_dir)
wn = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))

senses=wn.synsets('gouvernement',lang=u'fre')
Run Code Online (Sandbox Code Playgroud)

看来我手动下载的wordnet无法链接到处理外语的nltk模块的文件,我得到的错误如下:

Traceback (most recent call last):
File "C:/Users/Stephanie/Test/temp.py", line 16, in <module>
senses=wn.synsets('gouvernement',lang=u'fre')
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1419, in synsets
self._load_lang_data(lang)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1064, in _load_lang_data
if lang not in self.langs():
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1088, in langs
fileids = self._omw_reader.fileids()
AttributeError: 'FileSystemPathPointer' object has no attribute 'fileids'
Run Code Online (Sandbox Code Playgroud)

Using an english word doesn't generate any error (so it's not that I did not load the dictionary well) :

senses=wn.synsets('government')
print senses

[Synset('government.n.01'), Synset('government.n.02'), Synset('government.n.03'), Synset('politics.n.02')]
Run Code Online (Sandbox Code Playgroud)

If I use the current version of Wordnet loaded with the nltk module I don't have any problem using french (so it's not a syntax problem with the optional argument)

from nltk.corpus import wordnet as wn
senses=wn.synsets('gouvernement',lang=u'fre')
print senses
[Synset('government.n.02'), Synset('opinion.n.05'), Synset('government.n.03'), Synset('rule.n.01'), Synset('politics.n.02'), Synset('government.n.01'), Synset('regulation.n.03'), Synset('reign.n.03')]
Run Code Online (Sandbox Code Playgroud)

But, as precised, I really have to use the old version. I guess this might be a path problem. I've been trying to read the code of the WordNetCorpusReader function but I am quite new with python I don't really see what the problem is so far, except that it doesn't find a special file.

The needed file seems to be wn-data-fre.tab which is located in \nltk_data\corpora\omw\fre. I am pretty sure that I have to change the file with a version compatible with wordnet 1.6 but still, why the function WordNetCorpusReader can't find it ?

alv*_*vas 5

简答

没有带有语言参数的 WordNet 1.6。lang='fre'通过 NLTK 加载不同的 WordNet 时无法使用。


长答案

lang=...参数是使用开放多语言 WordNet(OMW:http ://compling.hss.ntu.edu.sg/omw/ )添加的,它将不同语言的 wordnet 链接到普林斯顿 WordNet 3.0 版。参见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1050

lang=...参数调用该函数:

def langs(self):
    ''' return a list of languages supported by Multilingual Wordnet '''
    import os
    langs = []
    fileids = self._omw_reader.fileids()
    for fileid in fileids:
        file_name, file_extension = os.path.splitext(fileid)
        if file_extension == '.tab':
            langs.append(file_name.split('-')[-1])

    return langs
Run Code Online (Sandbox Code Playgroud)

查找文件,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1070

 f = self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))
Run Code Online (Sandbox Code Playgroud)

那么if lang == 'fre',那么self._omw_reader = wn-data-fre.tab

并且 omw 找不到wn-data-fre.tabin 的nltk_data/corpora/omw/主要原因是因为您在初始化对象时设置了omw_readerto ,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet。 py#L1006wn16_pathWordNetCorpusReader

然后在加载法语数据时,找不到self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang)). (见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1419https://github.com/nltk/nltk/blob/develop/nltk/corpus/阅读器/wordnet.py#L1070 )


您可以尝试做的是加载 2 个 WordNet 实例:

import os
from nltk.corpus import wordnet as wn
from nltk.corpus import WordNetCorpusReader

cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"

wn16_path = "{0}/dict".format(wordnet16_dir)
wn16 = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))

def synset2offset(ss):
    return str(ss.offset()).zfill(8) + '-' + ss.pos()


wn16_ids = [synset2offset(ss) for ss in wn16.all_synsets()]
wn30_ids = [synset2offset(ss) for ss in wn.all_synsets()]


senses30 = wn.synsets('gouvernement',lang=u'fre')
senses16 = [ss for ss in wn.synsets('gouvernement',lang=u'fre') if synset2offset(ss) in wn16_ids]
Run Code Online (Sandbox Code Playgroud)