Spacy自定义标记生成器，使用Infix正则表达式仅包含连字符作为标记

Question

Spacy自定义标记生成器，使用Infix正则表达式仅包含连字符作为标记

Vis*_*hal 6 regex nlp linguistics tokenize spacy

我想将连字符（例如：长期，自尊等）包括为Spacy中的单个标记。在查看了Stackoverflow，Github，其文档以及其他地方的类似文章之后，我还编写了一个自定义令牌生成器，如下所示。

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

Run Code Online (Sandbox Code Playgroud)

因此对于这句话：'注：自十四世纪以来，“医学”的实践已成为一种职业；更重要的是，这是男性主导的职业。

现在，合并自定义Spacy令牌生成器后的令牌为：

'注意'， '：'， '自'， '的'， '十四'， '世纪'， '的'，'实践'的'， ““药' ' ” ' '有'，' ;”，“成为”，“一个”，“专业”，“，”，“和”，“更多”，“重要”，“，”， “是”， “一个”，“ 男性主导 ”，“专业”，“。”

之前，此更改之前的令牌为：

“注意”，“：”，“自”，“该”，“第十四个”，“世纪”，“该”，“实践”，“的”，“， ”，“ 医学 ”，“ ”，“， ' '变成'， 'A'， '专业'， ';'， '和'， '更多'， '重要的'，'， '' 它 '， “ 的 ”， 'A'， ' 男性 ' ，“ - ”，“ 主导 ”，“专业”，“。”

并且，预期令牌应为：

“注意”，“：”，“自”，“该”，“第十四个”，“世纪”，“该”，“实践”，“的”，“， ”，“ 医学 ”，“ ”，“， ' '变成'， 'A'， '专业'， ';'， '和'， '更多'， '重要的'，'， '' 它 '， “ 的 ”， 'A'，“ 以男性占主导地位的 '，'专业'，'。

如人们所见，连字符包括在内，其他标点符号也包括在双引号和撇号中。但是现在，撇号和双引号没有更早的或预期的行为。我为Infix尝试了正则表达式的不同排列和组合，但没有解决此问题的进度。因此，任何帮助将不胜感激。

Answer 1

Nic*_*ley 15

使用默认的prefix_re和suffix_re给了我预期的输出：

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

Run Code Online (Sandbox Code Playgroud)

[“注释”，“：”，“自”，“该”，“第十四”，“世纪”，“该”，“实践”，“的”，“”，“药物”，“”，“具有”，“成为”，“一个”，“专业”，“;”，“和”，“更多”，“重要”，“，”，“它”，“的”，“一个”，“男性” -主导”，“专业”，“。”]

如果您想弄清楚为什么您的正则表达式不能像SpaCy的那样工作，请参见以下相关源代码的链接：

此处定义的前缀和后缀：

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

关于此处定义的字符（例如，引号，连字符等）：

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

以及用于编译它们的函数（例如，compile_prefix_regex）：

https://github.com/explosion/spaCy/blob/master/spacy/util.py

我不能感谢你足够的尼古拉斯！:)它现在按预期工作。正如正确指出的那样，问题在于默认的prefix_re和suffix_re。也感谢您共享指向标点符号和引号字符（例如引号，连字符等）的链接以及用于编译它们的链接！它们非常方便，将有助于翻译，涵盖所有特殊情况，尤其是其他语言！ (2认同)

归档时间：	7 年，4 月前
查看次数：	3063 次
最近记录：	7 年，4 月前