如何获取Python正则表达式的字符类的完整列表以快速查找特定的特殊字符?

3 python regex

这个 python文档给出了元字符的完整列表

。^ $ * + ? { } [ ] \ | ( )

同样,是否有一个页面给出了字符类别的完整列表?

我假设该文档中的“字符类”指的是有限数量的某种特殊字符,而不是所有可能的 unicode 字符。如果有必要请纠正我。

我进行了搜索,但没有找到规范术语。

如果“字符类”确实指的是所有可能的unicode字符,我想将我的问题更改为“在python中查找正则表达式特殊字符的便捷方法”。

似乎正则表达式.info称之为“速记字符类”

更积极的例子(我正在寻找)是\d,,,等等\s;负面例子(我不是在寻找)是\S\Aabcdefghijklmnopqrstuvwxyz0123456789

我在Python doc和stackoverflow上搜索了“字符类”和“简写字符类”,但没有找到我想要的。

为什么我需要这个?当我阅读文档的一部分时,例如

\w 或 \S(定义如下)等字符类也可以在集合内接受,尽管它们匹配的字符取决于 ASCII 模式还是 LOCALE 模式是否有效。

我想知道\w代表什么。在文档中搜索或在谷歌中搜索都会花费我一些时间。例如,在该文档上使用 chrome 的搜索菜单命令,\w会得到 41 个结果。

如果有这些字符的列表,我可以通过不超过 2 次搜索(小写字母和大写字母)来查找所有内容。

Ray*_*ger 5

从 Shell 中可见的类别

该代码显示了所有“类别”。标记为“IN”的为字符类别(其他标记为字符之间的特定切片点):

>>> from pprint import pprint
>>> import sre_parse

>>> pprint(sre_parse.CATEGORIES)
{'\\A': (AT, AT_BEGINNING_STRING),
 '\\B': (AT, AT_NON_BOUNDARY),
 '\\D': (IN, [(CATEGORY, CATEGORY_NOT_DIGIT)]),
 '\\S': (IN, [(CATEGORY, CATEGORY_NOT_SPACE)]),
 '\\W': (IN, [(CATEGORY, CATEGORY_NOT_WORD)]),
 '\\Z': (AT, AT_END_STRING),
 '\\b': (AT, AT_BOUNDARY),
 '\\d': (IN, [(CATEGORY, CATEGORY_DIGIT)]),
 '\\s': (IN, [(CATEGORY, CATEGORY_SPACE)]),
 '\\w': (IN, [(CATEGORY, CATEGORY_WORD)])
Run Code Online (Sandbox Code Playgroud)

带“CATEGORY”的条目是字符类别

\w这也回答了代表什么的问题。它是一个“字字符”。另请参阅:在正则表达式中,\w* 是什么意思?

文档中解释的类别

这是在 的输出中print(re.__doc__)。它解释了每个类别的预期含义:

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    \S       Matches any non-whitespace character; equivalent to [^\s].
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.
Run Code Online (Sandbox Code Playgroud)

其他特殊字符组

除了简写字符类之外,sre_parse模块还详细介绍了其他有趣的字符组:

SPECIAL_CHARS = ".\\[{()*+?^$|"
REPEAT_CHARS = "*+?{"
DIGITS = frozenset("0123456789")
OCTDIGITS = frozenset("01234567")
HEXDIGITS = frozenset("0123456789abcdefABCDEF")
ASCIILETTERS = frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
WHITESPACE = frozenset(" \t\n\r\v\f")

ESCAPES = {
    r"\a": (LITERAL, ord("\a")),
    r"\b": (LITERAL, ord("\b")),
    r"\f": (LITERAL, ord("\f")),
    r"\n": (LITERAL, ord("\n")),
    r"\r": (LITERAL, ord("\r")),
    r"\t": (LITERAL, ord("\t")),
    r"\v": (LITERAL, ord("\v")),
    r"\\": (LITERAL, ord("\\"))
}

FLAGS = {
    # standard flags
    "i": SRE_FLAG_IGNORECASE,
    "L": SRE_FLAG_LOCALE,
    "m": SRE_FLAG_MULTILINE,
    "s": SRE_FLAG_DOTALL,
    "x": SRE_FLAG_VERBOSE,
    # extensions
    "a": SRE_FLAG_ASCII,
    "t": SRE_FLAG_TEMPLATE,
    "u": SRE_FLAG_UNICODE,
}
Run Code Online (Sandbox Code Playgroud)