用占位符替换字符串并在函数后替换它们.

alv*_*vas 6 python regex string replace placeholder

给定一个字符串和一个应该替换为占位符的子字符串列表,例如

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
Run Code Online (Sandbox Code Playgroud)

第一个目标是首先从更换子phrasesoriginal_text与索引的占位符,如

text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
Run Code Online (Sandbox Code Playgroud)

[OUT]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Run Code Online (Sandbox Code Playgroud)

然后会有一些函数来操纵text占位符,例如

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
Run Code Online (Sandbox Code Playgroud)

输出:

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
Run Code Online (Sandbox Code Playgroud)

最后一步是以倒退的方式进行替换并放回原始短语,即

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
Run Code Online (Sandbox Code Playgroud)

[OUT]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
Run Code Online (Sandbox Code Playgroud)

问题是:

  1. 如果输入的子列表phrases很大,那么进行第一次替换和最后一次替换的时间将花费很长时间.

有没有办法用正则表达式进行替换/替换?

  1. 使用re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)正则表达式替换并不是很有帮助.如果短语中的子串不匹配完整的单词,

例如

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
Run Code Online (Sandbox Code Playgroud)

我们得到一个尴尬的输出:

Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Run Code Online (Sandbox Code Playgroud)

我尝试过使用'\b{}\b'.format(phrase)但是对于带有标点符号的短语来说这不起作用,即

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
Run Code Online (Sandbox Code Playgroud)

[OUT]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
Run Code Online (Sandbox Code Playgroud)

是否有一些地方可以表示re.sub正则表达式中短语的单词边界?

Дми*_*нко 2

您可以分割它,而不是使用 re.sub!

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])
Run Code Online (Sandbox Code Playgroud)

这是上面示例的变量值:

initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,列表包含元组。它们包含两个值 -some stringboolean,表示它是文本还是替换值(True当为文本时)。获得替换列表后,您可以按照示例中的方式对其进行修改,检查它是否是文本值(if x[1] == True)。希望有帮助!

PS 字符串格式化 f"some string here {some_variable_here}" 需要 Python 3.6