Spacy中新命名的实体类

Khr*_*nko 2 python nlp named-entity-recognition spacy

我需要训练Spacy NER以便能够识别2个用于命名实体识别的新类,我所拥有的是具有应该在新类中的项列表的文件.

例如:滚石乐队,缪斯,北极猴子 - 艺术家任何想法如何做到这一点?

Dmy*_*hyi 6

对于MatcherPhraseMatcher来说,这似乎是一个完美的用例(如果你关心性能).

import spacy

nlp = spacy.load('en')

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])


matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='1', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Rolling'}, {spacy.attrs.ORTH: 'Stones'}]], on_match=merge_phrases)
matcher.add(entity_key='2', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Muse'}]], on_match=merge_phrases)
matcher.add(entity_key='3', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Arctic'}, {spacy.attrs.ORTH: 'Monkeys'}]], on_match=merge_phrases)
doc = nlp(u'The Rolling Stones are an English rock band formed in London in 1962. The first settled line-up consisted of Brian Jones, Ian Stewart, Mick Jagger, Keith Richards, Bill Wyman and Charlie Watts')
matcher(doc)
for ent in doc.ents:
  print(ent)
Run Code Online (Sandbox Code Playgroud)

有关详细信息,请参阅文档.根据我的经验,在Matcher中有400k实体,每个文档需要几乎一秒的时间.PhraseMatcher要快得多,但使用起来有点棘手.请注意,这是"严格"匹配器,它不会匹配以前没有见过的任何实体.