除非POS是显式的,否则WordNetLemmatizer不会返回正确的引理 - Python NLTK

Fly*_*ura 3 python nlp nltk wordnet lemmatization

我将Ted数据集脚本变形为lematizing.我注意到有些奇怪的事情:并非所有的词都被词状化了.说,

selected -> select
Run Code Online (Sandbox Code Playgroud)

哪个是对的.

然而,involved !-> involvehorsing !-> horse除非我明确地输入"V"(动词)属性.

在python终端上,我得到了正确的输出但不在我的代码中:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'
Run Code Online (Sandbox Code Playgroud)

代码的相关部分是这样的:

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv
Run Code Online (Sandbox Code Playgroud)

整个代码在这里.

问题是什么?

alv*_*vas 9

变形器要求正确的POS标签是准确的,如果使用默认设置WordNetLemmatizer.lemmatize(),默认标签是名词,请参阅https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py #L39

要解决此问题,请始终在lematizing之前对数据进行POS标记,例如

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     if not wntag:
...             lemma = word
...     else:
...             lemma = wnl.lemmatize(word, wntag)
...     print lemma
... 
This
be
a
foo
bar
sentence
Run Code Online (Sandbox Code Playgroud)

注意'是 - >是',即

>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
Run Code Online (Sandbox Code Playgroud)

用你的例子中的单词回答这个问题:

>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     lemma = wnl.lemmatize(word, wntag) if wntag else word
...     print lemma
... 
These
sentence
involve
some
horse
around
Run Code Online (Sandbox Code Playgroud)

请注意,WordNetLemmatizer存在一些怪癖:

此外,NLTK的默认POS标签正在进行一些重大改变,以提高准确性:

对于lemmatizer的开箱即用/现成解决方案,您可以查看https://github.com/alvations/pywsd以及我如何使用一些尝试例外来捕获单词不在WordNet中,请参阅https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66