alv*_*vas 6 python regex string replace placeholder
给定一个字符串和一个应该替换为占位符的子字符串列表,例如
import re
from copy import copy
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
Run Code Online (Sandbox Code Playgroud)
第一个目标是首先从更换子phrases在original_text与索引的占位符,如
text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
Run Code Online (Sandbox Code Playgroud)
[OUT]:
Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Run Code Online (Sandbox Code Playgroud)
然后会有一些函数来操纵text占位符,例如
cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
Run Code Online (Sandbox Code Playgroud)
输出:
MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
Run Code Online (Sandbox Code Playgroud)
最后一步是以倒退的方式进行替换并放回原始短语,即
' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
Run Code Online (Sandbox Code Playgroud)
[OUT]:
"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
Run Code Online (Sandbox Code Playgroud)
问题是:
phrases很大,那么进行第一次替换和最后一次替换的时间将花费很长时间.有没有办法用正则表达式进行替换/替换?
re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)正则表达式替换并不是很有帮助.如果短语中的子串不匹配完整的单词,例如
phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
Run Code Online (Sandbox Code Playgroud)
我们得到一个尴尬的输出:
Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Run Code Online (Sandbox Code Playgroud)
我尝试过使用'\b{}\b'.format(phrase)但是对于带有标点符号的短语来说这不起作用,即
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
Run Code Online (Sandbox Code Playgroud)
[OUT]:
Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
Run Code Online (Sandbox Code Playgroud)
是否有一些地方可以表示re.sub正则表达式中短语的单词边界?
您可以分割它,而不是使用 re.sub!
def do_something_with_str(string):
# do something with string here.
# for example let's wrap the string with "@" symbol if it's not empty
return f"@{string}" if string else string
def get_replaced_list(string, words):
result = [(string, True), ]
# we take each word we want to replace
for w in words:
new_result = []
# Getting each word in old result
for r in result:
# Now we split every string in results using our word.
split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])
# If we replace successfully - add all the strings
if len(split_list) > 1:
# This one would be for [text, replaced, text, replaced...]
sub_result = []
ws = [(w, False), ] * (len(split_list) - 1)
for x, replaced in zip(split_list, ws):
sub_result.append(x)
sub_result.append(replaced)
sub_result.append(split_list[-1])
# Add to new result
new_result.extend(sub_result)
# If not - just add it to results
else:
new_result.extend(split_list)
result = new_result
return result
if __name__ == '__main__':
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = get_replaced_list(initial_string, words_to_replace)
modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
final_string = ''.join([x[0] for x in modified_list])
Run Code Online (Sandbox Code Playgroud)
这是上面示例的变量值:
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,列表包含元组。它们包含两个值 -some string和boolean,表示它是文本还是替换值(True当为文本时)。获得替换列表后,您可以按照示例中的方式对其进行修改,检查它是否是文本值(if x[1] == True)。希望有帮助!
PS 字符串格式化 f"some string here {some_variable_here}" 需要 Python 3.6