匹配任何unicode信件？

Question

匹配任何unicode信件？

mpe*_*pen 10 python regex character-properties

在.net中你可以\p{L}用来匹配任何字母,我怎样才能在Python中做同样的事情？也就是说,我希望匹配任何大写,小写和重音字母.

Answer 1

Python的re模块尚不支持Unicode属性.但您可以使用re.UNICODE标志编译正则表达式,然后字符类速记\w也将匹配Unicode字母.

既然\w也会匹配数字,那么你需要从你的角色类中减去那些数字,以及下划线:

[^\W\d_]

Run Code Online (Sandbox Code Playgroud)

将匹配任何Unicode字母.

>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>

Run Code Online (Sandbox Code Playgroud)

Answer 2

Wik*_*żew 5

PyPi 正则表达式模块支持\p{L}Unicode 属性类，还有更多，请参阅文档中的“ Unicode 代码点属性，包括脚本和块”部分以及http://www.unicode.org/Public/UNIDATA/PropList.txt 上的完整列表。使用regex模块很方便，因为您可以在任何 Python 版本中获得一致的结果（请注意 Unicode 标准在不断发展，支持的字母数量也在增加）。

使用pip install regex（或pip3 install regex）安装库并使用

\p{L}        # To match any Unicode letter
\p{Lu}       # To match any uppercase Unicode letter
\p{Ll}       # To match any lowercase Unicode letter
\p{L}\p{M}*  # To match any Unicode letter and any amount of diacritics after it

Run Code Online (Sandbox Code Playgroud)

请参阅下面的一些使用示例：

import regex
text = r'Abc-++-???. It’s “???”!'
# Removing letters:
print( regex.sub(r'\p{L}+', '', text) ) # => -++-. ’ “”!
# Extracting letter chunks:
print( regex.findall(r'\p{L}+', text) ) # => ['Abc', '???', 'It', 's', '???']
# Removing all but letters:
print( regex.sub(r'\P{L}+', '', text) ) # => Abc???Its???
# Removing all letters but ASCII letters:
print( regex.sub(r'[^\P{L}a-zA-Z]+', '', text) ) # => Abc-++-. It’s “”!

Run Code Online (Sandbox Code Playgroud)

在线查看Python 演示

归档时间：	14 年，6 月前
查看次数：	5626 次
最近记录：	10 年，10 月前