如何告诉 Spacy 不要使用 retokenizer 用撇号分割任何单词？

Question

如何告诉 Spacy 不要使用 retokenizer 用撇号分割任何单词？

我在这里遇到了一个问题。我将使用 spacy 的单词分词器。但是我有一些限制，例如我的标记器不会拆分包含撇号 (') 的单词。

例子：

输入字符串：“我不能这样做”
当前输出：["I","ca","n't","do","this"]
预期输出：["I","can't","do","this"]

我的尝试：

doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
    for pos in position:
       retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
    print(token.text)

Run Code Online (Sandbox Code Playgroud)

通过这种方式，我得到了预期的输出。但是不知道这个方法对不对？或者有没有更好的方法来进行重新标记化？

Answer 1

aab*_*aab 5

retokenizer 方法有效，但更简单的方法是修改分词器，使其首先不会拆分这些单词。被分裂像这样用撇号的收缩（don't，can't，I'm，you'll，等）由分词器异常处理。

使用 spacy v2.2.3，您可以使用属性检查和设置标记器异常nlp.tokenizer.rules。要删除带有任何撇号的异常：

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if "'" not in key and "’" not in key and "‘" not in key}
assert [t.text for t in nlp("can't")] == ["can't"]

Run Code Online (Sandbox Code Playgroud)

请注意，spacy 为英语提供的默认模型（标记器、解析器、NER）在具有这种标记化的文本上效果不佳，因为它们是在收缩拆分的数据上进行训练的。

使用较旧版本的 spacy，您必须创建一个自定义标记器并rules=在修改nlp.Defaults.tokenizer_exceptions. 使用所有其他现有设置 ( nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match) 以在所有其他情况下保留现有标记化。

归档时间：	5 年，10 月前
查看次数：	1799 次
最近记录：	5 年，10 月前