Sté*_*e C 5 python nlp path nltk wordnet
出于特定目的,我必须使用 Wordnet 1.6 而不是 nltk 包中实现的当前版本。然后我在这里下载了旧版本,并尝试使用 french 选项运行一个简单的代码提取。
from collections import defaultdict
import nltk
#nltk.download()
import os
import sys
from nltk.corpus import WordNetCorpusReader
cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"
wn16_path = "{0}/dict".format(wordnet16_dir)
wn = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))
senses=wn.synsets('gouvernement',lang=u'fre')
Run Code Online (Sandbox Code Playgroud)
看来我手动下载的wordnet无法链接到处理外语的nltk模块的文件,我得到的错误如下:
Traceback (most recent call last):
File "C:/Users/Stephanie/Test/temp.py", line 16, in <module>
senses=wn.synsets('gouvernement',lang=u'fre')
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1419, in synsets
self._load_lang_data(lang)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1064, in _load_lang_data
if lang not in self.langs():
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1088, in langs
fileids = self._omw_reader.fileids()
AttributeError: 'FileSystemPathPointer' object has no attribute 'fileids'
Run Code Online (Sandbox Code Playgroud)
Using an english word doesn't generate any error (so it's not that I did not load the dictionary well) :
senses=wn.synsets('government')
print senses
[Synset('government.n.01'), Synset('government.n.02'), Synset('government.n.03'), Synset('politics.n.02')]
Run Code Online (Sandbox Code Playgroud)
If I use the current version of Wordnet loaded with the nltk module I don't have any problem using french (so it's not a syntax problem with the optional argument)
from nltk.corpus import wordnet as wn
senses=wn.synsets('gouvernement',lang=u'fre')
print senses
[Synset('government.n.02'), Synset('opinion.n.05'), Synset('government.n.03'), Synset('rule.n.01'), Synset('politics.n.02'), Synset('government.n.01'), Synset('regulation.n.03'), Synset('reign.n.03')]
Run Code Online (Sandbox Code Playgroud)
But, as precised, I really have to use the old version. I guess this might be a path problem. I've been trying to read the code of the WordNetCorpusReader function but I am quite new with python I don't really see what the problem is so far, except that it doesn't find a special file.
The needed file seems to be wn-data-fre.tab which is located in \nltk_data\corpora\omw\fre. I am pretty sure that I have to change the file with a version compatible with wordnet 1.6 but still, why the function WordNetCorpusReader can't find it ?
简答:
没有带有语言参数的 WordNet 1.6。lang='fre'通过 NLTK 加载不同的 WordNet 时无法使用。
长答案:
该lang=...参数是使用开放多语言 WordNet(OMW:http ://compling.hss.ntu.edu.sg/omw/ )添加的,它将不同语言的 wordnet 链接到普林斯顿 WordNet 3.0 版。参见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1050。
该lang=...参数调用该函数:
def langs(self):
''' return a list of languages supported by Multilingual Wordnet '''
import os
langs = []
fileids = self._omw_reader.fileids()
for fileid in fileids:
file_name, file_extension = os.path.splitext(fileid)
if file_extension == '.tab':
langs.append(file_name.split('-')[-1])
return langs
Run Code Online (Sandbox Code Playgroud)
查找文件,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1070:
f = self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))
Run Code Online (Sandbox Code Playgroud)
那么if lang == 'fre',那么self._omw_reader = wn-data-fre.tab。
并且 omw 找不到wn-data-fre.tabin 的nltk_data/corpora/omw/主要原因是因为您在初始化对象时设置了omw_readerto ,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet。 py#L1006。wn16_pathWordNetCorpusReader
然后在加载法语数据时,找不到self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang)). (见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1419和https://github.com/nltk/nltk/blob/develop/nltk/corpus/阅读器/wordnet.py#L1070 )
您可以尝试做的是加载 2 个 WordNet 实例:
import os
from nltk.corpus import wordnet as wn
from nltk.corpus import WordNetCorpusReader
cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"
wn16_path = "{0}/dict".format(wordnet16_dir)
wn16 = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))
def synset2offset(ss):
return str(ss.offset()).zfill(8) + '-' + ss.pos()
wn16_ids = [synset2offset(ss) for ss in wn16.all_synsets()]
wn30_ids = [synset2offset(ss) for ss in wn.all_synsets()]
senses30 = wn.synsets('gouvernement',lang=u'fre')
senses16 = [ss for ss in wn.synsets('gouvernement',lang=u'fre') if synset2offset(ss) in wn16_ids]
Run Code Online (Sandbox Code Playgroud)