以下为答案提供的代码问题;
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
Run Code Online (Sandbox Code Playgroud)
结果
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
Run Code Online (Sandbox Code Playgroud)
下面使用的第一个 (r"[./]") 和最后一个 (r"(.'.)") 模式是什么? …