spaCy - 按标签对实体进行排序的最有效方法

Question

spaCy - 按标签对实体进行排序的最有效方法

Lui*_*cri 3 python entity named-entity-recognition spacy

我正在使用 spaCy 管道从文章中提取所有实体。我需要根据它们所标记的标签将这些实体保存在变量上。现在我有这个解决方案，但我认为这不是最合适的解决方案，因为我需要迭代每个标签的所有实体：

nlp = spacy.load("es_core_news_md")
text = # I upload my text here
doc = nlp(text)

personEntities = list(set([e.text for e in doc.ents if e.label_ == "PER"]))
locationEntities = list(set([e.text for e in doc.ents if e.label_ == "LOC"]))
organizationEntities = list(set([e.text for e in doc.ents if e.label_ == "ORG"]))

Run Code Online (Sandbox Code Playgroud)

spaCy 中是否有直接方法来获取每个标签的所有实体，或者我需要做什么才能for ent in ents: if... elif... elif...实现这一目标？

Answer 1

Wik*_*żew 5

我建议使用groupby以下方法itertools：

from itertools import *
#...
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}

Run Code Online (Sandbox Code Playgroud)

或者，如果您只需要提取唯一值：

entities = {key: list(set(map(lambda x: str(x), g))) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}

Run Code Online (Sandbox Code Playgroud)

然后，您可以使用打印已知实体

print(entities['ORG'])

Run Code Online (Sandbox Code Playgroud)

如果您需要获取实体对象的唯一出现次数，而不仅仅是字符串，您可以使用

import spacy
from itertools import *

nlp = spacy.load("en_core_web_sm")
s = "Hello, Mr. Wood! We are in New York. Mrs. Winston is not coming, John hasn't sent her any invite. They will meet in California next time. General Motors and Toyota are companies."
doc = nlp(s * 2)

entities = dict()
for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_):
    seen = set()
    l = []
    for ent in list(g):
      if ent.text not in seen:
        seen.add(ent.text)
        l.append(ent)
    entities[key] = l

Run Code Online (Sandbox Code Playgroud)

的输出print(entities['GPE'][0].text)在New York这里。

归档时间：	6 年，1 月前
查看次数：	2478 次
最近记录：	5 年，11 月前