Spacy 替换令牌

sac*_*ruk 2 python spacy

我试图在不破坏句子中的空间结构的情况下替换一个单词。假设我有这个句子text = "Hi this is my dog."。我希望用Simba. 按照/sf/answers/4004442151/的回答,我做了:

import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc

doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba . 
Run Code Online (Sandbox Code Playgroud)

请注意在句号之前的末尾有一个额外的空间(应该是Hi this is my Simba.)。有没有办法消除这种行为。也很高兴获得一般的 Python 字符串处理答案。

小智 5

下面的函数替换任意数量的匹配项(用 spaCy 找到),保持与原始文本相同的空格,并适当处理边缘情况(例如匹配项位于文本的开头):

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_lg")

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.
Run Code Online (Sandbox Code Playgroud)


小智 5

Spacy 令牌有一些可以帮助您的属性。首先是token.text_with_ws,它为您提供标记的文本及其原始尾随空格(如果有)。其次,token.whitespace_,它只返回令牌上的尾随空格(如果没有空格则为空字符串)。如果您不需要大型语言模型来完成您正在做的其他事情,您可以使用 Spacy 的分词器。

from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")

modified = ""
for token in tokens:
    if token.text != "dog":
        modified += token.text_with_ws
    else:
        modified += "Simba"
        modified += token.whitespace_
Run Code Online (Sandbox Code Playgroud)