使用 Spacy 从文本文件中提取名称

Sli*_*ind 5 nlp nltk data-extraction python-3.x spacy

我有一个文本文件,其中包含如下所示的行:

Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST

The patient was referred by Dr. Jacob Austin.  

Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST

Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST

The patient was referred by
Dr. Jayden Green Olivia.  
Run Code Online (Sandbox Code Playgroud)

我想使用 Spacy 提取所有名称。我正在使用 Spacy 的词性标记和实体识别,但无法获得成功。我可以知道它是如何做到的吗?任何帮助将是可观的

我正在以这种方式使用一些代码:

import spacy
nlp = spacy.load('en')
 document_string= " Electronically signed by stupid: Dr. John Douglas, M.D.; 
 Jun 13 2018 11:13AM CST"
doc = nlp(document_string)
 for sentence in doc.ents:
     print(sentence, sentence.label_) 
Run Code Online (Sandbox Code Playgroud)

use*_*737 8

模型准确性问题

\n\n

所有模型的问题在于它们没有 100% 的准确度,甚至使用更大的模型也无助于识别日期。以下是 NER 模型的准确度值(F 分数、精确度、召回率)——它们都在 86% 左右。

\n\n
document_string = """ \nElectronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST \n The patient was referred by Dr. Jacob Austin.   \nElectronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST \nElectronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST \nThe patient was referred by \nDr. Jayden Green Olivia.   \n"""  \n
Run Code Online (Sandbox Code Playgroud)\n\n

对于小模型,两个日期项被标记为“PERSON”:

\n\n
import spacy                                                                                                                            \n\nnlp = spacy.load(\'en\')                                                                                                                  \nsents = nlp(document_string) \n [ee for ee in sents.ents if ee.label_ == \'PERSON\']                                                                                      \n# Out:\n# [Wes Scott,\n#  Jun 26,\n#  Jacob Austin,\n#  Robert Clowson,\n#  John Douglas,\n#  Jun 16 2017,\n#  Jayden Green Olivia]\n
Run Code Online (Sandbox Code Playgroud)\n\n

对于较大的模型,en_core_web_md结果在精度方面甚至更差,因为存在三个错误分类的实体。

\n\n
nlp = spacy.load(\'en_core_web_md\')                                                                                                                  \nsents = nlp(document_string) \n# Out:\n#[Wes Scott,\n# Jun 26,\n# Jacob Austin,\n# Robert Clowson,\n# Janury,\n# John Douglas,\n# Jun 16 2017,\n# Jayden Green Olivia]\n
Run Code Online (Sandbox Code Playgroud)\n\n

我还尝试了其他模型(xx_ent_wiki_smen_core_web_md),但它们也没有带来任何改进。

\n\n

使用规则来提高准确性怎么样?

\n\n

在这个小例子中,不仅文档似乎具有清晰的结构,而且错误分类的实体都是日期。那么为什么不将初始模型与基于规则的组件结合起来呢?

\n\n

好消息是 Spacy 中:

\n\n
\n

可以通过多种方式组合统计和基于规则的组件。基于规则的组件可用于提高\n统计模型的准确性

\n
\n\n

(来自https://spacy.io/usage/rule-based-matching#models-rules

\n\n

因此,通过遵循示例并使用dateparser库(人类可读日期的解析器),我已经组合了一个基于规则的组件,该组件在此示例中运行良好:

\n\n
from spacy.tokens import Span\nimport dateparser\n\ndef expand_person_entities(doc):\n    new_ents = []\n    for ent in doc.ents:\n        # Only check for title if it\'s a person and not the first token\n        if ent.label_ == "PERSON":\n            if ent.start != 0:\n                # if person preceded by title, include title in entity\n                prev_token = doc[ent.start - 1]\n                if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):\n                    new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)\n                    new_ents.append(new_ent)\n                else:\n                    # if entity can be parsed as a date, it\'s not a person\n                    if dateparser.parse(ent.text) is None:\n                        new_ents.append(ent) \n        else:\n            new_ents.append(ent)\n    doc.ents = new_ents\n    return doc\n\n# Add the component after the named entity recognizer\n# nlp.remove_pipe(\'expand_person_entities\')\nnlp.add_pipe(expand_person_entities, after=\'ner\')\n\ndoc = nlp(document_string)\n[(ent.text, ent.label_) for ent in doc.ents if ent.label_==\'PERSON\']\n# Out:\n# [(\xe2\x80\x98Wes Scott\', \'PERSON\'),\n#  (\'Dr. Jacob Austin\', \'PERSON\'),\n#  (\'Robert Clowson\', \'PERSON\'),\n#  (\'Dr. John Douglas\', \'PERSON\'),\n#  (\'Dr. Jayden Green Olivia\', \'PERSON\')]\n
Run Code Online (Sandbox Code Playgroud)\n


pol*_*m23 1

尝试这个:

import spacy
en = spacy.load('en')

sents = en(open('input.txt').read())
people = [ee for ee in sents.ents if ee.label_ == 'PERSON']
Run Code Online (Sandbox Code Playgroud)