如何使用新的训练数据集更新我训练过的 space ner 模型？

Question

如何使用新的训练数据集更新我训练过的 space ner 模型？

我是 nlp 的新手，我开始学习如何在 spacy 中训练自定义 ner。

TRAIN_DATA = [
          ('what is the price of polo?', {'entities': [(21, 25, 'Product')]}), 
          ('what is the price of ball?', {'entities': [(21, 25, 'Product')]}), 
          ('what is the price of jegging?', {'entities': [(21, 28, 'Product')]}), 
          ('what is the price of t-shirt?', {'entities': [(21, 28, 'Product')]}), 
          ('what is the price of jeans?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of bat?', {'entities': [(21, 24, 'Product')]}), 
          ('what is the price of shirt?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of bag?', {'entities': [(21, 24, 'Product')]}), 
          ('what is the price of cup?', {'entities': [(21, 24, 'Product')]}), 
          ('what is the price of jug?', {'entities': [(21, 24, 'Product')]}), 
          ('what is the price of plate?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of glass?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of moniter?', {'entities': [(21, 28, 'Product')]}), 
          ('what is the price of desktop?', {'entities': [(21, 28, 'Product')]}), 
          ('what is the price of bottle?', {'entities': [(21, 27, 'Product')]}), 
          ('what is the price of mouse?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of keyboad?', {'entities': [(21, 28, 'Product')]}), 
          ('what is the price of chair?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of table?', {'entities': [(21, 26, 'Product')]}), 
          ('what is the price of watch?', {'entities': [(21, 26, 'Product')]})
]

Run Code Online (Sandbox Code Playgroud)

第一次训练空白spacy模型：

def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
   

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
         ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp


start_training = train_spacy(TRAIN_DATA, 20)

Run Code Online (Sandbox Code Playgroud)

保存我训练过的 spacy 模型：

# Saveing the trained model
start_training.to_disk("spacy_start_model")

Run Code Online (Sandbox Code Playgroud)

我的问题是如何使用新的训练数据更新保存的模型？新的训练数据：

TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
            ('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]

Run Code Online (Sandbox Code Playgroud)

任何人都可以帮我解决这个问题并提供建议吗？提前致谢！

Answer 1

Emi*_*tti 5

据我所知，您可以使用新的数据示例重新训练模型，但您现在可以从现有模型开始，而不是从空白模型开始。

为了实现这一点，它将首先从您的train_spacy方法中删除以下行，并且可能会接收模型作为参数：

nlp = spacy.blank('en')  # create blank Language class

Run Code Online (Sandbox Code Playgroud)

然后，要重新训练模型，而不是加载 spacy 空白模型并传递给训练方法，请使用该load方法加载现有模型，然后调用训练方法（在此处阅读有关 spacy save/load 的更多信息）。

start_training = spacy.load("spacy_start_model")

Run Code Online (Sandbox Code Playgroud)

最后一个建议是，在我的实践中，我通过从现有模型（例如en_core_web_md或en_core_web_lg重新训练 spacy NER 模型，添加我的自定义实体，获得了比从 spacy 空白模型从头开始训练更好的结果。

全部一起：

方法更新

def train_spacy(data, iterations, nlp):  # <-- Add model as nlp parameter
    TRAIN_DATA = data
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe('ner')
   

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
         ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp

nlp = spacy.blank('en')  # create blank Language class
start_training = train_spacy(TRAIN_DATA, 20, nlp)

Run Code Online (Sandbox Code Playgroud)

重新训练你的模型

TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
            ('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]

nlp = spacy.load("spacy_start_model")  # <-- Now your base model is your custom model
start_training = train_spacy(TRAIN_DATA_2, 20, nlp)

Run Code Online (Sandbox Code Playgroud)

我希望这对你有用！

归档时间：	3 年，11 月前
查看次数：	1639 次
最近记录：	2 年，9 月前