仅具有“空白”规则的 Spacy 分词器

Ser*_*gio 4 python nlp python-3.x spacy

我想知道 spacy 分词器是否可以仅使用“空格”规则对单词进行分词。例如:

sentence= "(c/o Oxford University )"
Run Code Online (Sandbox Code Playgroud)

通常,使用spacy的以下配置:

nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
   print(token)
Run Code Online (Sandbox Code Playgroud)

结果将是:

 (
 c
 /
 o
 Oxford
 University
 )
Run Code Online (Sandbox Code Playgroud)

相反,我想要如下的输出(使用 spacy):

(c/o 
Oxford 
University
)
Run Code Online (Sandbox Code Playgroud)

使用 spacy 是否可以获得这样的结果?

Ser*_*nov 11

让我们用正则nlp.tokenizer表达式进行自定义更改:Tokenizertoken_match

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])

nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Run Code Online (Sandbox Code Playgroud)
Before: [This, is, it, 's]
After : [This, is, it's]
Run Code Online (Sandbox Code Playgroud)

您可以Tokenizer通过添加自定义后缀、前缀和中缀规则来进一步调整。

另一种更细粒度的方法是找出为什么it's令牌像这样被分割nlp.tokenizer.explain()

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)
Run Code Online (Sandbox Code Playgroud)

你会发现分裂是由于SPECIAL规则造成的:

[('TOKEN', 'This'),
 ('TOKEN', 'is'),
 ('SPECIAL-1', 'it'),
 ('SPECIAL-2', "'s"),
 ('SUFFIX', '.'),
 ('SPECIAL-1', 'I'),
 ('SPECIAL-2', "'m"),
 ('TOKEN', 'fine')]
Run Code Online (Sandbox Code Playgroud)

可以更新以从异常中删除“it's”,例如:

exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
Run Code Online (Sandbox Code Playgroud)
[This, is, it's., I, 'm, fine]
Run Code Online (Sandbox Code Playgroud)

或者完全删除撇号上的分割:

filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
Run Code Online (Sandbox Code Playgroud)
[This, is, it's., I'm, fine]
Run Code Online (Sandbox Code Playgroud)

请注意标记上附加的点,这是由于未指定后缀规则造成的。