小编Zee*_*Ali的帖子

SpaCy - 字内连字符。一字之差如何对待他们?

以下为答案提供的代码问题;

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])
Run Code Online (Sandbox Code Playgroud)

结果

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']  
Run Code Online (Sandbox Code Playgroud)

下面使用的第一个 (r"[./]") 和最后一个 (r"(.'.)") 模式是什么? …

nlp tokenize spacy

2
推荐指数
1
解决办法
1699
查看次数

标签 统计

nlp ×1

spacy ×1

tokenize ×1