将NER SpaCy格式转换为IOB格式

eng*_*019 4 nlp named-entity-recognition spacy ner

我有已经以 SpaCy 格式标记的数据。例如:

("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})
Run Code Online (Sandbox Code Playgroud)

但我想尝试使用任何其他 NER 模型来训练它,例如 BERT-NER,它需要 IOB 标记。是否有从 SpaCy 数据格式到 IOB 的转换代码?

谢谢!

Jin*_*ich 6

恐怕您必须编写自己的转换,因为 IOB 编码取决于预训练表示模型(BERT、RoBERTa 或您选择的任何预训练模型)使用的标记化。

\n\n

SpaCy格式指定实体的字符跨度,即

\n\n
"Who is Shaka Khan?"[7:17]\n
Run Code Online (Sandbox Code Playgroud)\n\n

将返回"Shaka Khan"。您需要将其与预训练模型使用的标记进行匹配。

\n\n

以下是当您使用Huggingface 的 Transformers时不同模型如何标记例句的示例。

\n\n
    \n
  • 伯特:[\'Who\', \'is\', \'S\', \'##hak\', \'##a\', \'Khan\', \'?\']
  • \n
  • 罗伯塔:[\'Who\', \'_is\', \'_Sh\', \'aka\', \'_Khan\', \'?\']
  • \n
  • XL网:[\'\xe2\x96\x81Who\', \'\xe2\x96\x81is\', \'\xe2\x96\x81Shak\', \'a\', \'\xe2\x96\x81Khan\', \'?\']
  • \n
\n\n

了解分词器如何工作后,您就可以实现转换。像这样的东西可以用于 BERT 标记化。

\n\n
entities = [(7, 17, "PERSON")]}\ntokenized = [\'Who\', \'is\', \'S\', \'##hak\', \'##a\', \'Khan\', \'?\']\n\ncur_start = 0\nstate = "O" # Outside\ntags = []\nfor token in tokenized:\n    # Deal with BERT\'s way of encoding spaces\n    if token.startswith("##"):\n        token = token[2:]\n    else:\n        token = " " + token\n\n    cur_end = cur_start + len(token)\n    if state == "O" and cur_start < entities[0][0] < cur_end:\n        tags.append("B-" + entitites[0][2])\n        state = "I-" + entitites[0][2]\n    elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:\n        tags.append(state)\n        state = "O"\n        entities.pop(0)\n    else:\n        tags.append(state)\n    cur_start = cur_end\n
Run Code Online (Sandbox Code Playgroud)\n\n

请注意,如果一个 BERT 令牌包含一个实体的结尾和下一个实体的开头,则该片段将会中断。分词器也不区分原始字符串中有多少空格(或其他空格),这也是潜在的错误来源。

\n


aab*_*aab 6

这与/sf/answers/4144656421/密切相关并且大部分是从/sf/answers/4144656421/复制的,也请参阅那里的评论中的注释:

import spacy
from spacy.gold import biluo_tags_from_offsets

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    # then convert L->I and U->B to have IOB tags for the tokens in the doc
Run Code Online (Sandbox Code Playgroud)