更新 Spacy 的内置 NER 模型而不是覆盖

sod*_*zs1 5 python nlp python-3.x spacy spacy-3

我正在使用 Spacy 的内置模型,并且en_core_web_lg希望使用我的自定义实体来训练它。在这样做的同时,我面临两个问题,

  1. 它会用旧数据覆盖新的训练数据,并导致无法识别其他实体。例如,在训练之前,它可以识别 PERSON 和 ORG,但是在训练之后,它不能识别 PERSON 和 ORG。

  2. 在训练过程中,它给了我以下错误,

UserWarning: [W030] Some entities could not be aligned in the text "('I work in Google.',)" with entities "[(9, 15, 'ORG')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
Run Code Online (Sandbox Code Playgroud)

这是我的整个代码,

import spacy
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example
sentence = ""
body1 = "James work in Facebook and love to have tuna fishes in the breafast."
nlp_lg = spacy.load("en_core_web_lg")
print(nlp_lg.pipe_names)
doc = nlp_lg(body1)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


train = [
    ('I had tuna fish in breakfast', {'entities': [(6,14,'FOOD')]}),
    ('I love prawns the most', {'entities': [(6,12,'FOOD')]}),
    ('fish is the rich source of protein', {'entities': [(0,4,'FOOD')]}),
    ('I work in Google.', {'entities': [(9,15,'ORG')]})
    ]


ner = nlp_lg.get_pipe("ner")

for _, annotations in train:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

disable_pipes = [pipe for pipe in nlp_lg.pipe_names if pipe != 'ner']

with nlp_lg.disable_pipes(*disable_pipes):
    optimizer = nlp_lg.resume_training()
    for interation in range(30):
        random.shuffle(train)
        losses = {}

        batches = minibatch(train, size=compounding(1.0,4.0,1.001))
        for batch in batches:
            text, annotation = zip(*batch)
            doc1 = nlp_lg.make_doc(str(text))
            example = Example.from_dict(doc1, annotations)
            nlp_lg.update(
                [example],
                drop = 0.5,
                losses = losses,
                sgd = optimizer
                )
            print("Losses",losses)

doc = nlp_lg(body1)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Run Code Online (Sandbox Code Playgroud)

预期输出:

James 0 5 PERSON
Facebook 14 22 ORG
tuna fishes 40 51 FOOD

Run Code Online (Sandbox Code Playgroud)

目前尚未识别任何实体..

请让我知道我哪里做错了。谢谢!

pol*_*m23 1

你所描述的“覆盖”被称为“灾难性遗忘”,spaCy 博客上有一篇关于它的文章。没有完美的解决方法,但我们最近在这里进行了修复。

关于您的对齐错误...

“('我在 Google 工作。',)”,实体为“[(9, 15, 'ORG')]”

你的字符偏移量已关闭。

"I work in Google."[9:15]
# => " Googl"
Run Code Online (Sandbox Code Playgroud)

也许它们偏离了一个常数值,您可以通过向所有内容添加 1 来解决此问题,但您需要查看数据才能弄清楚这一点。