python中的wordnet词形还原和pos标记

use*_*217 53 python nltk wordnet lemmatization

我想在python中使用wordnet lemmatizer并且我已经知道默认的pos标签是NOUN并且它没有为动词输出正确的引理,除非明确指定了pos标签作为VERB.

我的问题是,准确执行上述词形还原的最佳镜头是什么?

我做了pos标记使用nltk.pos_tag,我迷失了将树库pos标签集成到wordnet兼容的pos标签.请帮忙

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
Run Code Online (Sandbox Code Playgroud)

我得到NN,JJ,VB,RB的输出标签.如何将这些更改为wordnet兼容标签?

我还需要nltk.pos_tag()使用带标记的语料库进行训练,还是可以直接在我的数据上进行评估?

Suz*_*ana 73

首先,您可以nltk.pos_tag()直接使用而无需培训.该函数将从文件加载预训练的标记器.您可以看到文件名nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle' 
Run Code Online (Sandbox Code Playgroud)

由于它是使用Treebank语料库训练的,因此它也使用Treebank标记集.

以下函数将树库标记映射到WordNet部分语音名称:

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''
Run Code Online (Sandbox Code Playgroud)

然后,您可以使用lemmatizer返回值:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'
Run Code Online (Sandbox Code Playgroud)

在将它传递给Lemmatizer之前检查返回值,因为空字符串会给出一个KeyError.

  • 还记得卫星形容词=)`ADJ_SAT ='s'`http://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html (13认同)
  • @alvas哪些树库标签应该映射到ADJ_SAT WordNet标签? (3认同)
  • "我喜欢它"中的''它'的pos标签."`字符串是''PRP'.该函数返回一个空字符串,该引理器不接受该字符串并抛出一个"KeyError".在那种情况下可以做些什么? (2认同)

pg2*_*455 10

与nltk.corpus.reader.wordnet的源代码一样(http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)

#{ Part-of-speech constants
 ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#}
POS_LIST = [NOUN, VERB, ADJ, ADV]
Run Code Online (Sandbox Code Playgroud)

  • 或者更一般地说:来自nltk.corpus import wordnet; print wordnet._FILEMAP; (3认同)

Dee*_*pak 8

转换步骤:文档 - >句子 - >标记 - > POS-> Lemmas

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'

class Splitter(object):
    """
    split the document into sentences and tokenize each sentence
    """
    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):
        """
        out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.']
        """
        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(sent) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

#step 1 split document into sentence followed by tokenization
tokens = splitter.split(text)

#step 2 lemmatization using pos tagger 
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
print(lemma_pos_token)
Run Code Online (Sandbox Code Playgroud)


Shu*_*hia 8

您可以使用python默认字典创建地图,并利用lemmatizer的默认标签为Noun的事实。

from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "Another way of achieving this task"
tokens = word_tokenize(text)
lmtzr = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)
Run Code Online (Sandbox Code Playgroud)


Hah*_*pro 5

@Suzana_K 正在工作。但是我在 KeyError 中有一些情况导致 @ Clock Slave 提到。

将树库标签转换为 Wordnet 标签

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None # for easy if-statement 
Run Code Online (Sandbox Code Playgroud)

现在,只有当我们有 wordnet 标签时,我们才将 pos 输入到 lemmatize 函数中

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
for word, tag in tagged:
    wntag = get_wordnet_pos(tag)
    if wntag is None:# not supply tag in case of None
        lemma = lemmatizer.lemmatize(word) 
    else:
        lemma = lemmatizer.lemmatize(word, pos=wntag) 
Run Code Online (Sandbox Code Playgroud)