Spacy 替换令牌

Question

Spacy 替换令牌

我试图在不破坏句子中的空间结构的情况下替换一个单词。假设我有这个句子text = "Hi this is my dog."。我希望用Simba. 按照/sf/answers/4004442151/的回答，我做了：

import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc

doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba .

Run Code Online (Sandbox Code Playgroud)

请注意在句号之前的末尾有一个额外的空间（应该是Hi this is my Simba.）。有没有办法消除这种行为。也很高兴获得一般的 Python 字符串处理答案。

Answer 1

小智 5

下面的函数替换任意数量的匹配项（用 spaCy 找到），保持与原始文本相同的空格，并适当处理边缘情况（例如匹配项位于文本的开头）：

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_lg")

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 5

Spacy 令牌有一些可以帮助您的属性。首先是token.text_with_ws，它为您提供标记的文本及其原始尾随空格（如果有）。其次，token.whitespace_，它只返回令牌上的尾随空格（如果没有空格则为空字符串）。如果您不需要大型语言模型来完成您正在做的其他事情，您可以使用 Spacy 的分词器。

from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")

modified = ""
for token in tokens:
    if token.text != "dog":
        modified += token.text_with_ws
    else:
        modified += "Simba"
        modified += token.whitespace_

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，7 月前
查看次数：	2133 次
最近记录：	4 年，6 月前