Sta*_*ark 4 python spacy spacy-3
我是 nlp 的新手,我开始学习如何在 spacy 中训练自定义 ner。
TRAIN_DATA = [
('what is the price of polo?', {'entities': [(21, 25, 'Product')]}),
('what is the price of ball?', {'entities': [(21, 25, 'Product')]}),
('what is the price of jegging?', {'entities': [(21, 28, 'Product')]}),
('what is the price of t-shirt?', {'entities': [(21, 28, 'Product')]}),
('what is the price of jeans?', {'entities': [(21, 26, 'Product')]}),
('what is the price of bat?', {'entities': [(21, 24, 'Product')]}),
('what is the price of shirt?', {'entities': [(21, 26, 'Product')]}),
('what is the price of bag?', {'entities': [(21, 24, 'Product')]}),
('what is the price of cup?', {'entities': [(21, 24, 'Product')]}),
('what is the price of jug?', {'entities': [(21, 24, 'Product')]}),
('what is the price of plate?', {'entities': [(21, 26, 'Product')]}),
('what is the price of glass?', {'entities': [(21, 26, 'Product')]}),
('what is the price of moniter?', {'entities': [(21, 28, 'Product')]}),
('what is the price of desktop?', {'entities': [(21, 28, 'Product')]}),
('what is the price of bottle?', {'entities': [(21, 27, 'Product')]}),
('what is the price of mouse?', {'entities': [(21, 26, 'Product')]}),
('what is the price of keyboad?', {'entities': [(21, 28, 'Product')]}),
('what is the price of chair?', {'entities': [(21, 26, 'Product')]}),
('what is the price of table?', {'entities': [(21, 26, 'Product')]}),
('what is the price of watch?', {'entities': [(21, 26, 'Product')]})
]
Run Code Online (Sandbox Code Playgroud)
第一次训练空白spacy模型:
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp
start_training = train_spacy(TRAIN_DATA, 20)
Run Code Online (Sandbox Code Playgroud)
保存我训练过的 spacy 模型:
# Saveing the trained model
start_training.to_disk("spacy_start_model")
Run Code Online (Sandbox Code Playgroud)
我的问题是如何使用新的训练数据更新保存的模型?新的训练数据:
TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]
Run Code Online (Sandbox Code Playgroud)
任何人都可以帮我解决这个问题并提供建议吗?提前致谢!
据我所知,您可以使用新的数据示例重新训练模型,但您现在可以从现有模型开始,而不是从空白模型开始。
为了实现这一点,它将首先从您的train_spacy方法中删除以下行,并且可能会接收模型作为参数:
nlp = spacy.blank('en') # create blank Language class
Run Code Online (Sandbox Code Playgroud)
然后,要重新训练模型,而不是加载 spacy 空白模型并传递给训练方法,请使用该load方法加载现有模型,然后调用训练方法(在此处阅读有关 spacy save/load 的更多信息)。
start_training = spacy.load("spacy_start_model")
Run Code Online (Sandbox Code Playgroud)
最后一个建议是,在我的实践中,我通过从现有模型(例如en_core_web_md或en_core_web_lg重新训练 spacy NER 模型,添加我的自定义实体,获得了比从 spacy 空白模型从头开始训练更好的结果。
全部一起:
def train_spacy(data, iterations, nlp): # <-- Add model as nlp parameter
TRAIN_DATA = data
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
else:
ner = nlp.get_pipe('ner')
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp
nlp = spacy.blank('en') # create blank Language class
start_training = train_spacy(TRAIN_DATA, 20, nlp)
Run Code Online (Sandbox Code Playgroud)
TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]
nlp = spacy.load("spacy_start_model") # <-- Now your base model is your custom model
start_training = train_spacy(TRAIN_DATA_2, 20, nlp)
Run Code Online (Sandbox Code Playgroud)
我希望这对你有用!
| 归档时间: |
|
| 查看次数: |
1639 次 |
| 最近记录: |