如何打印出 Wo​​rdNet 同义词集的主要引理?蟒蛇NLTK

mod*_*ish 2 python nltk wordnet

我有大量的 WordNet 同义词集。这个集合的一小部分是:

syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
Run Code Online (Sandbox Code Playgroud)

我想为集合中的每个同义词打印出同义词集术语(同义词集的主要引理)。例如,上述集合的输出应该是:

brutal, benignant
Run Code Online (Sandbox Code Playgroud)

这是我使用的代码:

    from nltk.corpus import wordnet as wn
    for s in syns:
        print(wn.s.lemmas[0])
Run Code Online (Sandbox Code Playgroud)

但这不起作用,因为 s 被视为字符串,而不是对象。我收到以下错误:

AttributeError: 'WordNetCorpusReader' object has no attribute 's'
Run Code Online (Sandbox Code Playgroud)

这是因为 s 被视为一个字符串,而不是一个对象。我尝试将 s 更改为字节形式,如下所示:

    s = bytes(s)
Run Code Online (Sandbox Code Playgroud)

但这不起作用。如何以最简单的方式仅打印出上述引理?

我检查了here,这是一个很好的方法,但是我的一组同义词集是字符串形式的,而不是实际的对象。

提前致谢..

alv*_*vas 5

TL;DR

>>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
>>> [wn.synset(i[8:-2]) for i in syns]
[Synset('benignant.s.02'), Synset('brutal.s.04')]
>>> syns = [wn.synset(i[8:-2]) for i in syns]
>>> syns[0].lemma_names()
[u'benignant', u'gracious']
Run Code Online (Sandbox Code Playgroud)

Firstly to get an input with the type printed out in strings is weird. So the first intuitive approach would be do something like ast.literal_eval() or eval() with the Synset type, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L305 (but before that see http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html):

>>> from nltk.corpus.reader.wordnet import Synset
>>> from nltk.corpus import wordnet as wn
>>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
>>> [eval(i) for i in syns]
[Synset('None'), Synset('None')]
Run Code Online (Sandbox Code Playgroud)

Apparently, Synset class won't work independent of the nltk.corpus.wordnet. So we take a look at the wordnet.synset() function instead (https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1217). It seems like it only takes the pre-assigned name of a Synset object, so:

>>> wn.synset('brutal.s.04')
Synset('brutal.s.04')
>>> type(wn.synset('brutal.s.04'))
<class 'nltk.corpus.reader.wordnet.Synset'>
Run Code Online (Sandbox Code Playgroud)

And after which when the pseudo string synset in your input syns becomes a Synset, you can easily control the Synset as what is shown How do I print out just the word itself in a WordNet synset using Python NLTK?

Back to your weird input syns, doing the following will give me the name of the synset:

>>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
>>> list(syns)[0]
"Synset('benignant.s.02')"
>>> list(syns)[0][8:-2]
'benignant.s.02'
Run Code Online (Sandbox Code Playgroud)

So back to converting it into a Synset:

>>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
>>> [wn.synset(i[8:-2]) for i in syns]
[Synset('benignant.s.02'), Synset('brutal.s.04')]
>>> syns = [wn.synset(i[8:-2]) for i in syns]
>>> syns[0].lemma_names()
[u'benignant', u'gracious']
Run Code Online (Sandbox Code Playgroud)

But let's roll back altogether, you're getting a weird input syns because someone has saved their output by simply casting a str() to a Synset object:

>>> syns[0]
Synset('benignant.s.02')
>>> str(syns[0])
"Synset('benignant.s.02')"
Run Code Online (Sandbox Code Playgroud)

The person could have simply done:

>>> syns[0].name()
u'benignant.s.02'
Run Code Online (Sandbox Code Playgroud)

Which then your input syns object will look like this:

syns = {u'brutal.s.04', u'benignant.s.02'}
Run Code Online (Sandbox Code Playgroud)

and to read it, you can simply do:

>>> from nltk.corpus import wordnet as wn
>>> syns = {u'brutal.s.04', u'benignant.s.02'}
>>> syns = [wn.synset(i) for i in syns]
>>> syns[0]
Synset('brutal.s.04')
>>> syns[0].lemma_names()
[u'brutal']
Run Code Online (Sandbox Code Playgroud)