spacy lemmatizer是如何工作的?

Lui*_*uez 11 python nlp wordnet lemmatization spacy

对于词形还原,spacy有一个单词列表:形容词,副词,动词......以及例外列表:adverbs_irreg ...对于常规词汇,有一套规则

我们以"更广泛"这个词为例

因为它是一个形容词,所以词典化的规则应该从这个列表中取出:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 
Run Code Online (Sandbox Code Playgroud)

据我所知,这个过程将是这样的:

1)获取单词的POS标签,以了解它是否是名词,动词......
2)如果不是应用其中一个规则,则直接替换不正常情况列表中的单词.

现在,如何决定用"呃" - >"e"而不是"呃" - >""来获得"宽"而不是"wid"?

在这里它可以测试.

alv*_*vas 12

让我们从类定义开始:https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

它从初始化3个变量开始:

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules
Run Code Online (Sandbox Code Playgroud)

现在,看着self.exc英文,我们看到它指向https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/ 初始化的.py在那里它从目录中加载文件的https: //github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

为什么Spacy不读文件?

很可能是因为声明字符串in-code比通过I/O流式传输字符串更快.


这些索引,例外和规则来自哪里?

仔细观察,它们似乎都来自原始的普林斯顿WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

规则

看着它更接近,对规则https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py类似于_morphy从规则nltk https://github.com/ NLTK/NLTK/BLOB /开发/ NLTK /胼/读卡器/ wordnet.py#L1749

这些规则最初来自Morphy软件https://wordnet.princeton.edu/man/morphy.7WN.html

此外,spacy还包括一些不是来自Princeton Morphy的标点规则:

PUNCT_RULES = [
    ["“", "\""],
    ["”", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
]
Run Code Online (Sandbox Code Playgroud)

例外

至于例外,它们存储在*_irreg.py文件中spacy,看起来它们也来自普林斯顿Wordnet.

很明显,如果我们看一下原始WordNet .exc(排除)文件的一些镜像(例如https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl /data/wordnet/wn21/adj.exc)如果您wordnet从中下载包nltk,我们会看到它是相同的列表:

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc
Run Code Online (Sandbox Code Playgroud)

指数

如果我们看看spacylemmatizer index,我们看到它也来自Wordnet,例如https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py和重新分发的副本of wordnet in nltk:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  
Run Code Online (Sandbox Code Playgroud)

基于spacylemmatizer使用的字典,例外和规则主要来自普林斯顿WordNet及其Morphy软件,我们可以继续看看如何spacy使用索引和异常来应用规则的实际实现.

我们回到https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

主要动作来自函数而不是Lemmatizer类:

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)
Run Code Online (Sandbox Code Playgroud)

为什么这个lemmatize方法在Lemmatizer课外?

我不完全确定,但也许,确保可以在类实例之外调用词形还原函数,但是考虑到@staticmethod并且@classmethod存在或许存在关于为什么函数和类已经解耦的其他考虑因素

Morphy vs Spacy

spacylemmatize()函数与morphy()nltk 中的函数进行比较(最初来自http://blog.osteele.com/2004/04/pywordnet-20/十多年前创建),morphy()Oliver Steele的Python端口中的主要过程WordNet形态的是:

  1. 检查例外列表
  2. 将规则应用于输入一次以获得y1,y2,y3等.
  3. 返回数据库中的所有内容(并检查原始数据库)
  4. 如果没有匹配项,请继续应用规则,直到找到匹配项
  5. 如果我们找不到任何东西,请返回一个空列表

对于spacy可能的话,它仍然处于开发阶段,由于TODO在线路https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

但一般过程似乎是:

  1. 查找异常,如果异常列表中的引理(如果该单词在其中)则获取它们.
  2. 应用规则
  3. 保存索引列表中的那些
  4. 如果步骤1-3中没有引理,那么只需跟踪词汇外单词(OOV)并将原始字符串附加到引理表单
  5. 返回引理形式

在OOV处理方面,如果没有找到词形化形式,spacy返回原始字符串,在这方面,nltk执行morphy相同,例如

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
Run Code Online (Sandbox Code Playgroud)

在词形还原之前检查不定式

可能另一个不同点是如何morphyspacy决定将哪个POS分配给该单词.在这方面,如果单词已经是不定式形式(is_base_form()),则将spacy一些语言学规则放在Lemmatizer()判断单词是否为基本形式并完全跳过词形还原的情况下,如果将词形还原化,则会节省相当多的时间.完成了语料库中的所有单词,其中很大一部分是不定式(已经是引理形式).

但这是可能的,spacy因为它允许引理器访问与某些形态规则紧密相关的POS.虽然morphy虽然可以使用细粒度的PTB POS标签来找出一些形态,但仍然需要花费一些精力来对它们进行排序以了解哪些形式是不定式的.

一般来说,形态特征的3个主要信号需要在POS标签中取消:

  • 性别

更新

SpaCy在最初的答案(5月12日)之后确实对他们的lemmatizer进行了更改.我认为目的是在没有查找和规则处理的情况下使词典化更快.

因此,他们将词语预先词形化,并将它们保留在查找哈希表中,以便为他们预先词形化的词语检索O(1)https://github.com/explosion/spaCy/blob/master/spacy/lang /en/lemmatizer/lookup.py

此外,为了统一跨语言的词形变换器,这个词形变换器现在位于https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

但是上面讨论的底层词形还原步骤仍然与当前的spacy版本相关(4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)


结语

我想现在我们知道它适用于语言学规则和所有,另一个问题是"是否存在任何非基于规则的词形还原方法?"

但在回答之前的问题之前,"究竟什么是引理?" 可能是更好的问题.


小智 8

TLDR: spaCy检查它尝试生成的引理是否在已知的单词列表或该词性的异常中.

答案很长:

查看lemmatizer.py文件,特别lemmatize是底部的函数.

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)
Run Code Online (Sandbox Code Playgroud)

例如,对于英语形容词,它接收我们正在评估的字符串index,已知的形容词exceptions,和rules,如您所引用的,来自此目录(对于英语模型).

lemmatize在使字符串小写后我们做的第一件事是检查字符串是否在我们的已知异常列表中,其中包括诸如"更糟糕" - >"坏"之类的单词的引理规则.

然后,rules如果适用,我们将通过我们的每一个应用到字符串.对于这个词wider,我们将应用以下规则:

["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
Run Code Online (Sandbox Code Playgroud)

我们将输出以下表格:["wid", "wide"].

然后,我们检查这个形式是否在我们index已知的形容词中.如果是,我们将其附加到表单中.否则,我们将其添加到oov_forms,我猜这是词汇量的缩写.wide在索引中,所以它被添加.wid被添加到oov_forms.

最后,我们返回一组找到的引理,或者匹配规则但不在我们的索引中的任何引理,或者只返回单词本身.

您在上面发布的单词 - 引理链接适用于wider,因为wide在单词索引中.尝试类似He is blandier than I.spaCy会将blandier(我组成的单词)标记为形容词,但它不在索引中,所以它只会blandier作为引理返回.