目标是将此 Perl 正则表达式(来自此处)移植到 Python 中:
$norm_text =~ s/(\P{N})(\p{P})/$1 $2 /g;
Run Code Online (Sandbox Code Playgroud)
首先,我将\p{P}和\P{N}字符数组复制到一个可读的文本文件中:
IE
import requests
from six import text_type
n_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Number.txt'
p_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Punctuation.txt'
NUMS = text_type(requests.get(n_url).content.decode('utf8'))
PUNCTS = text_type(requests.get(p_url).content.decode('utf8'))
Run Code Online (Sandbox Code Playgroud)
但是当我尝试编译正则表达式时:
re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)
Run Code Online (Sandbox Code Playgroud)
它抛出这个错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
return _compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
p = _parse_sub(source, state, sub_verbose)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ~-- at position 217 (line 1, column 218)
Run Code Online (Sandbox Code Playgroud)
环顾四周,问题似乎是在字符集中没有转义的破折号,Python regex bad character range。.
看起来有一系列类似破折号的符号:
>>> NUMS[215:352]
'~----------------------------------------------------------------------------------------------------------------------------------------'
Run Code Online (Sandbox Code Playgroud)
然后我将破折号字符移到字符串的前面,但有更多坏字符:
>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> re.compile(u'([{n}])'.format(n=NUMS2))
Run Code Online (Sandbox Code Playgroud)
[出去]:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
return _compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
p = _parse_sub(source, state, sub_verbose)
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ¬-- at position 502 (line 1, column 503)
Run Code Online (Sandbox Code Playgroud)
所以我把更多的字符移到了前面:
>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> NUMS3 = re.escape(NUMS2[500:504]) + NUMS2[:500] + NUMS2[504:]
>>> re.compile(u'([{n}])'.format(n=NUMS3))
Run Code Online (Sandbox Code Playgroud)
这似乎是在正则表达式中检测什么是“坏字符范围”的无休止的循环。
有没有办法自动识别正则表达式中的所有“坏字符”并将它们移到前面?
这里的主要问题是,你需要躲避^,-,]和\字符类中字符。
用
NUMS = re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)
PUNCTS = re.sub(r'[]^\\-]', r'\\\g<0>', PUNCTS)
rx = re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)
Run Code Online (Sandbox Code Playgroud)
该r'[]^\\-]'模式会匹配1个炭- ,,或-和替换将替换用匹配值和匹配值。]^\-r'\\\g<0>'\