Nem*_*emo 3 python regex spacy
我有以下文字
text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
Run Code Online (Sandbox Code Playgroud)
当我使用普通的正则表达式时,我得到了以下内容
import re
regex = '\d{1}[a|p]m'
re.findall(regex, text)
# Returned:
['5am', '6am', '9pm', '6am', '6am', '6pm']
Run Code Online (Sandbox Code Playgroud)
但是,当我regex
在 spaCy 中使用它时,我一无所获
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}]
matcher.add('TIME', None, pattern)
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.sent.text)
Run Code Online (Sandbox Code Playgroud)
这是否意味着我们不能在 spaCy 中使用普通的正则表达式?如果是这样,您知道我可以在哪里学习spaCy的特殊正则表达式语法吗?谢谢你。
Wik*_*żew 10
您需要记住,数字将与此处的字母分开,请参阅测试:
doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']
Run Code Online (Sandbox Code Playgroud)
根据Spacy 文档:
如果 spaCy 的标记化与模式中定义的标记不匹配,则该模式不会产生任何结果。
您需要使用基于规则的匹配来定义自己的实体:
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
Run Code Online (Sandbox Code Playgroud)
然后将其添加到匹配器:
matcher.add('TIME', None, pattern)
Run Code Online (Sandbox Code Playgroud)
并获得比赛:
for match_id, start, end in matches:
span = doc[start:end] # The matched span
print(span.text)
Run Code Online (Sandbox Code Playgroud)
完整演示:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2762 次 |
最近记录: |