小编Vis*_*hal的帖子

Spacy自定义标记生成器，使用Infix正则表达式仅包含连字符作为标记

我想将连字符（例如：长期，自尊等）包括为Spacy中的单个标记。在查看了Stackoverflow，Github，其文档以及其他地方的类似文章之后，我还编写了一个自定义令牌生成器，如下所示。

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

Run Code Online (Sandbox Code Playgroud)

因此对于这句话：'注：自十四世纪以来，“医学”的实践已成为一种职业；更重要的是，这是男性主导的职业。

现在，合并自定义Spacy令牌生成器后的令牌为：

'注意'， '：'， '自'， '的'， '十四'， '世纪'， '的'，'实践'的'， ““药' ' ” ' '有'，' ;”，“成为”，“一个”，“专业”，“，”，“和”，“更多”，“重要”，“，”， “是”， “一个”，“ …

regex nlp linguistics tokenize spacy

Vis*_*hal

2018 06-25

6
推荐指数

1
解决办法

3063
查看次数