我试图在不破坏句子中的空间结构的情况下替换一个单词。假设我有这个句子text = "Hi this is my dog."。我希望用Simba. 按照/sf/answers/4004442151/的回答,我做了:
import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc
doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba .
Run Code Online (Sandbox Code Playgroud)
请注意在句号之前的末尾有一个额外的空间(应该是Hi this is my Simba.)。有没有办法消除这种行为。也很高兴获得一般的 Python 字符串处理答案。
小智 5
下面的函数替换任意数量的匹配项(用 spaCy 找到),保持与原始文本相同的空格,并适当处理边缘情况(例如匹配项位于文本的开头):
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])
def replace_word(orig_text, replacement):
tok = nlp(orig_text)
text = ''
buffer_start = 0
for _, match_start, _ in matcher(tok):
if match_start > buffer_start: # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
text += replacement + tok[match_start].whitespace_ # Replace token, with trailing whitespace if available
buffer_start = match_start + 1
text += tok[buffer_start:].text
return text
>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.
>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.
Run Code Online (Sandbox Code Playgroud)
小智 5
Spacy 令牌有一些可以帮助您的属性。首先是token.text_with_ws,它为您提供标记的文本及其原始尾随空格(如果有)。其次,token.whitespace_,它只返回令牌上的尾随空格(如果没有空格则为空字符串)。如果您不需要大型语言模型来完成您正在做的其他事情,您可以使用 Spacy 的分词器。
from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")
modified = ""
for token in tokens:
if token.text != "dog":
modified += token.text_with_ws
else:
modified += "Simba"
modified += token.whitespace_
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2133 次 |
| 最近记录: |