Chi*_*cha 3 nlp machine-learning spacy
我们已经准备好一个模型,用于标识自定义命名实体。问题是,如果给出了整个文档,那么模型将无法按预期工作,如果只给出了几句话,它会给出惊人的结果。
我想选择标记实体之前和之后的两个句子。
例如。如果文档的一部分有 world Colombo(标记为 GPE),我需要选择标签之前的两个句子和标签之后的两个句子。我尝试了几种方法,但复杂性太高。
spacy 有内置的方法可以解决这个问题吗?
我正在使用 python 和 spacy。
我尝试通过识别标签的索引来解析文档。但这种方法确实很慢。
看看是否可以改进自定义命名实体识别器可能是值得的,因为额外的上下文损害性能应该是不常见的,并且如果解决该问题,它可能会整体工作得更好。
但是,关于您关于周围句子的具体问题:
AToken
或 a Span
(实体是 a Span
)具有一个.sent
属性,可以将覆盖语句提供为 a Span
。如果您查看给定句子的开始/结束标记之前/之后的标记,则可以获取文档中任何标记的上一个/下一个句子。
import spacy
def get_previous_sentence(doc, token_index):
if doc[token_index].sent.start - 1 < 0:
return None
return doc[doc[token_index].sent.start - 1].sent
def get_next_sentence(doc, token_index):
if doc[token_index].sent.end + 1 >= len(doc):
return None
return doc[doc[token_index].sent.end + 1].sent
nlp = spacy.load('en_core_web_lg')
text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."
doc = nlp(text)
for ent in doc.ents:
print(ent, ent.label_, ent.sent)
print("Prev:", get_previous_sentence(doc, ent.start))
print("Next:", get_next_sentence(doc, ent.start))
print("----")
Run Code Online (Sandbox Code Playgroud)
输出:
Jane PERSON Jane is a name.
Prev: None
Next: Here is a sentence.
----
Jane PERSON Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
2010 DATE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Someone else is the mayor of Colombo right now.
Prev: And here is yet another padding sentence without entities.
Next: None
----
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1105 次 |
最近记录: |