Spacy Matcher - 只匹配最长的字符串

use*_*036 4 python matcher spacy

我正在尝试使用 spacy 模式匹配器创建名词块。例如,如果我有一句话“冰球混战花了几个小时。” 我想返回“冰球混战”和“小时”。我目前有这个:

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None,  [{"POS": "NOUN"}, {"POS": "NOUN", "OP": "*"}, {"POS": "NOUN", "OP": "*"}] )

doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id] 
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)
Run Code Online (Sandbox Code Playgroud)

但它返回的是“冰球混战”的所有版本,而不仅仅是最长的版本。

12482938965902279598 NounChunks 1 2 ice
12482938965902279598 NounChunks 1 3 ice hockey
12482938965902279598 NounChunks 2 3 hockey
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 2 4 hockey scrimmage
12482938965902279598 NounChunks 3 4 scrimmage
12482938965902279598 NounChunks 5 6 hours
Run Code Online (Sandbox Code Playgroud)

在如何定义模式方面我缺少什么吗?我希望它只返回:

12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 5 6 hours
Run Code Online (Sandbox Code Playgroud)

Raq*_*qib 7

我不知道有什么内置方法可以过滤掉最长的跨度,但是有一个实用函数spacy.util.filter_spans(spans)可以帮助解决这个问题。它选择给定跨度中最长的跨度,如果多个重叠跨度具有相同的长度,则优先考虑跨度列表中第一个出现的跨度。

import spacy 

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None,  [{"POS": "NOUN", "OP": "+"}] )

doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)

spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))
Run Code Online (Sandbox Code Playgroud)

输出

[ice hockey scrimmage, hours]
Run Code Online (Sandbox Code Playgroud)