如何在Spacy中使用多个模型创建NER管道

Question

如何在Spacy中使用多个模型创建NER管道

我正在尝试为spancy NER训练新实体。我尝试将新实体添加到现有的spacy“ en”模型中。但是，这影响了“ en”和我的新实体的预测模型。

因此，我创建了一个空白模型并训练了实体识别。这很好。但是，它只能预测我接受过训练的对象，而不能预测常规的伪造实体识别。

假设我将“马”训练为动物实体。

对于给定的文本

txt ='Did you know that George bought those horses for 10000 dollars?'

Run Code Online (Sandbox Code Playgroud)

期望以下实体得到认可

George - PERSON
horses - ANIMAL
10000 dollars - MONEY.

Run Code Online (Sandbox Code Playgroud)

在我当前的设置中，它只能识别马匹。

nlp = spacy.load('en')
hsnlp = spacy.load('models/spacy/animal/')
nlp.add_pipe(hsnlp.pipeline[-1][-1], 'hsner')

nlp.pipe_names

Run Code Online (Sandbox Code Playgroud)

这给

----------------------
['tagger', 'parser', 'ner', 'hsner']
----------------------

Run Code Online (Sandbox Code Playgroud)

但是当我尝试执行

doc = nlp(txt)  *<-- Gives me kernel error and stops working*

Run Code Online (Sandbox Code Playgroud)

请让我知道如何有效地为NER创建管道。我正在使用spacy 2.0.18

Answer 1

aab*_*aab 6

主要问题是如何加载和组合管道组件，以便它们使用相同的Vocab( nlp.vocab)，因为管道假定所有组件共享相同的词汇，否则您可能会遇到与StringStore.

您不应该尝试组合使用不同词向量训练的管道组件，但只要向量相同，问题就在于如何从具有相同词汇的不同模型中加载组件。

没有办法做到这一点spacy.load()，所以我认为最简单的选择是用所需的词汇初始化一个新的管道组件，并通过临时序列化将现有组件重新加载到新组件中。

为了使用易于访问的模型进行简短的工作演示，我将展示如何将德语 NER 模型添加de_core_news_sm到英语模型中，en_core_web_sm即使这不是您通常想要做的事情：

import spacy # tested with v2.2.3
from spacy.pipeline import EntityRecognizer

text = "Jane lives in Boston. Jan lives in Bremen."

# load the English and German models
nlp_en = spacy.load('en_core_web_sm')  # NER tags PERSON, GPE, ...
nlp_de = spacy.load('de_core_news_sm') # NER tags PER, LOC, ...

# the Vocab objects are not the same
assert nlp_en.vocab != nlp_de.vocab

# but the vectors are identical (because neither model has vectors)
assert nlp_en.vocab.vectors.to_bytes() == nlp_de.vocab.vectors.to_bytes()

# original English output
doc1 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc1.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Bremen', 'GPE')]

# original German output (the German model makes weird predictions for English text)
doc2 = nlp_de(text)
print([(ent.text, ent.label_) for ent in doc2.ents])
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

# initialize a new NER component with the vocab from the English pipeline
ner_de = EntityRecognizer(nlp_en.vocab)

# reload the NER component from the German model by serializing
# without the vocab and deserializing using the new NER component
ner_de.from_bytes(nlp_de.get_pipe("ner").to_bytes(exclude=["vocab"]))

# add the German NER component to the end of the English pipeline
nlp_en.add_pipe(ner_de, name="ner_de")

# check that they have the same vocab
assert nlp_en.vocab == ner_de.vocab

# combined output (English NER runs first, German second)
doc3 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc3.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Jan lives', 'PER'), ('Bremen', 'GPE')]

Run Code Online (Sandbox Code Playgroud)

Spacy 的 NER 组件（EntityRuler和EntityRecognizer）旨在保留任何现有实体，因此新组件仅添加Jan lives德语 NER 标签，PER并保留所有其他实体，如英语 NER 所预测的那样。

您可以使用选项add_pipe()来确定组件在管道中的插入位置。在默认的英语 NER 之前添加德语 NER：

nlp_en.add_pipe(ner_de, name="ner_de", before="ner")
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

Run Code Online (Sandbox Code Playgroud)

所有add_pipe()选项都在文档中：https : //spacy.io/api/language#add_pipe

您可以将扩展管道作为单个模型保存到磁盘，以便下次可以在一行中加载它spacy.load()：

nlp_en.to_disk("/path/to/model")
nlp_reloaded = spacy.load("/path/to/model")
print(nlp_reloaded.pipe_names) # ['tagger', 'parser', 'ner', 'ner_de']

Run Code Online (Sandbox Code Playgroud)

至少在我看来，在本地（spacy 2.1.8），如果 `is_nered` 为 true，则行为似乎是“不要运行 NER”。特别是，我使用两个不同的序列化 NER 模型运行了这段代码，并且第三组实体与第一组实体完全相同。 (2认同)

归档时间：	6 年，11 月前
查看次数：	326 次
最近记录：	6 年，11 月前