Python Regex转义运算符\替换和原始字符串

Ber*_*nes 8 python regex substitution backslash rawstring

我不理解python正则表达式中scape运算符\的功能以及原始字符串的r'的逻辑。一些帮助表示赞赏。

码:

import re
text=' esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
Run Code Online (Sandbox Code Playgroud)

该理论说:反斜杠字符('\')表示特殊形式或允许使用特殊字符而无需调用特殊含义。

就此问题末尾提供的链接而言,r'表示原始字符串,即符号没有特殊含义,它保持不变。

所以在上面的正则表达式中,我希望text2和text3是不同的,因为替换文本是'。'。在文本2中,即句点,而(原则上)文本3中的替代文本为r'。这是一个原始字符串,即应显示的字符串,反斜杠和句点。但它们的结果相同:

结果是:

text0=  esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation
text1=  esto.es  10. er- 12.23 with [  and.Other ] here is more; puntuation
text2=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
text3=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
Run Code Online (Sandbox Code Playgroud)

在我看来,r'在替换部分和反斜杠中的工作方式均不同。另一方面,我的直觉告诉我我在这里缺少一些东西。

编辑1:以下@Wiktor Stribi?ew评论。他指出(按照他的链接):

import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
Run Code Online (Sandbox Code Playgroud)

这使:

ab
a6b
Run Code Online (Sandbox Code Playgroud)

这让我更加困惑。

注:我读大约原始字符串是超完整的堆栈溢出问题。然而,它没有谈论替代

Wik*_*żew 6

First and foremost,

replacement patterns ? regular expression patterns
Run Code Online (Sandbox Code Playgroud)

We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.

NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.

Replacement pattern syntax in Python

The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).

I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).

So, in a replacement pattern, you may use backreferences:

re.sub(r'\D(\d)\D', r'\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b')  # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
Run Code Online (Sandbox Code Playgroud)

You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.

So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.

\ is a special character in Python replacement pattern

If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.

That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.

Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:

re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)
Run Code Online (Sandbox Code Playgroud)