NER 训练循环中的损失在 spacy 中并未减少

Abi*_*aul 5 python nlp deep-learning spacy

我正在尝试训练一种新的实体类型“HE INST”——识别大学。这是唯一的新标签。我有一份很长的原始文本文档。我对其运行 NER 并将实体保存到 TRAIN DATA,然后将新的实体标签添加到 TRAIN_DATA(我在重叠的地方进行了替换)。

训练循环的损失值恒定(所有 15 个文本约为 4000),单个数据的损失值约为 300。为什么会发生这种情况,如何正确训练模型。我有大约 18 个文本,其中有 40 个带注释的新实体。即使在所有迭代之后,模型仍然无法正确预测输出。

我没有对剧本做太多改动。刚刚添加了 en_core_web_lg、新标签和我的 TRAIN_DATA

我正在尝试从简历(CV)数据中标记机构:

这将是我在 TRAIN_DATA 中的文本之一:(太长的文本了)我有大约 18 个这样的文本连接起来形成 TRAIN_DATA

[("To perform better in my work each day. To increase my knowledge. To bring out my best by hardworking and improving my skills. To serve my parents and my family. To contribute my skills to my country. Marital ; Single Status Nationality \xe2\x80\x94: Indian Known . Parr . English, Malayalam, Hindi, Tamil Languages Hobby Playing cricket and football, Listening to music, Movies, Games. Father's ; V.N. Balappan Nair Name Mother's ; Saraswathy B Nair Name Believers Church Caarmel Engineering College R-Perunad Btech Electronics and communication engineering 6.09(Upto S6) 2015 - 2019 Marthoma Senior Secondary School Kozhencherry All India Senior School Certificate Examination 75% 2014 - 2015 Marthoma Senior Secondary School Kozhencherry Secondary School Examination 8.2 2012 - 2013 s@ INTERESTS Electronics, Sports s@ PERSONAL STRENGTHS Hardworking Loyal Good Team Spirit Good in mathematics ees IAA eM LANL NUL e (2 Problem Solving Skills rg DUS \\ TRAININGS completed the Vocational Industrial Training on Long Distance Communication Systems conducted by Southern Telecom Region, Bharat Sanchar Nigam Limited. Completed the internship training in Power Electronics Group(PEG), Tool Room, Fabrication Shop, Transform Winding, Electro Plating, Security And Surveillance Group(SSG), Special Products Group(SPG), Search And Rescue Beacon(SRB), Intelligent Tracking and Communication Project and Technology Development Center of Keltron Equipment Complex, Thiruvananthapuram. PROJECTS Final Year Project: Life Detection Using Quadcopter This project is useful at the time of natural calamities like flood earthquake etc... And can also be used in military applications as this device detects life signals using a PIR sensor and a thermal sensor. The components used in this are: PIR sensor, Thermal sensor, Arduino Nano, BEC, ESC, Quadcopter. Design project: Wireless Power Bank Wireless Power Bank enables us to charge our phone wordlessly. It can charge a device which is kept 10m(maximum) away from the adaptor without any obstacles in between. It uses the IR technology for power transmission. ACHIEVEMENTS & AWARDS Participated in Pecardio Debugging Conducted as a part of NAKSHATRA 2019, The Annual National Level Techno Cultural Fest held at Saingits College of Engineering, kottayam. Volunteered in Alexa One day workshop on Artificial intelligence. Completed a period of two year tenue with a total of 240 hours in the National Service Scheme activities and has attended NSS Annual Special Camp. Participant in Cricket and football at the Annual Sports Meets. DECLARATION do here by confirm that the information given in this form is true to the best of my knowledge and belief.", {'entities': [(29, 37, 'DATE'), (210, 223, 'ORG'), (241, 247, 'NORP'), (256, 260, 'PERSON'), (263, 270, 'LANGUAGE'), (272, 281, 'PERSON'), (283, 288, 'PERSON'), (290, 295, 'NORP'), (362, 375, 'EVENT'), (388, 401, 'PERSON'), (402, 420, 'PERSON'), (423, 445, 'PERSON'), (446, 490, 'HE INST'), (563, 574, 'DATE'), (575, 620, 'ORG'), (625, 668, 'ORG'), (669, 672, 'PERCENT'), (673, 684, 'DATE'), (685, 717, 'ORG'), (764, 775, 'DATE'), (779, 800, 'ORG'), (890, 893, 'ORG'), (909, 910, 'CARDINAL'), (963, 997, 'ORG'), (1001, 1036, 'ORG'), (1050, 1073, 'ORG'), (1075, 1103, 'ORG'), (1142, 1169, 'ORG'), (1172, 1181, 'ORG'), (1183, 1199, 'ORG'), (1201, 1218, 'ORG'), (1220, 1235, 'ORG'), (1275, 1301, 'ORG'), (1304, 1332, 'ORG'), (1335, 1355, 'ORG'), (1360, 1415, 'ORG'), (1419, 1444, 'ORG'), (1446, 1464, 'LOC'), (1475, 1494, 'EVENT'), (1797, 1809, 'GPE'), (1811, 1814, 'GPE'), (1816, 1819, 'ORG'), (1821, 1831, 'ORG'), (1849, 1888, 'ORG'), (1969, 1980, 'CARDINAL'), (2050, 2052, 'ORG'), (2088, 2122, 'ORG'), (2126, 2154, 'ORG'), (2168, 2182, 'EVENT'), (2188, 2194, 'DATE'), (2239, 2270, 'HE INST'), (2297, 2302, 'GPE'), (2303, 2310, 'DATE'), (2358, 2369, 'DATE'), (2370, 2378, 'DATE'), (2401, 2410, 'TIME'), (2414, 2441, 'ORG'), (2470, 2493, 'ORG'), (2534, 2557, 'EVENT')]})]
Run Code Online (Sandbox Code Playgroud)

脚本如下:(注意:- eval 函数用于在从文本文件中将 TRAIN_DATA 作为字符串读取后将其解析为列表 -----您很可能知道这一点,但以防万一)

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
import en_core_web_lg
from spacy.util import minibatch, compounding


# new entity label
LABEL = "HE INST"

with open('train_dump-backup.txt', 'r') as i_file:
    t_data = i_file.read()
TRAIN_DATA=eval(t_data)

@plac.annotations(
    model=("en_core_web_lg", "option", "m", str),
    new_model_name=("NLP_INST", "option", "nm", str),
    output_dir=("/home/drbinu/Downloads/NLP_INST", "option", "o", Path),
    n_iter=("30", "option", "n", int),
)

def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "B.Tech from Believers Church Caarmel Engineering College CGPA of 8.9"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)


if __name__ == "__main__":
    plac.call(main)
Run Code Online (Sandbox Code Playgroud)

aik*_*er2 4

损失似乎正在增加,因为管道组件在更新步骤中增加了损失:

https://github.com/explosion/spaCy/blob/ae4af52ce7dd9dda0eb0f1b8eeb0cba7d20facdf/spacy/pipeline/pipes.pyx#L989

在每个 epoch 开始时,您可能需要对总累积损失进行快照;在纪元结束时,您可以计算观察到的数据的平均损失。