Python中的正则表达式和Unicode:sub和findall之间的区别

Question

我在尝试找出Python(2.7)脚本中的错误时遇到了困难.我在识别特殊字符时使用sub和findall有所不同.

这是代码:

>>> re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
u'Castaeda'
>>> re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)
[]

当我使用findall时,它正确地将ñ视为字母字符,但是当我使用sub时它会替换它 - 将其视为非字母字符.

我已经能够使用findall和string.replace获得正确的功能,但这似乎是一个糟糕的解决方案.另外,我想使用re.split,我遇到与re.sub相同的问题.

在此先感谢您的帮助.

Answer 1

呼叫签名re.sub是:

re.sub(pattern, repl, string, count=0)

所以

re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)

设置count为re.UNICODE,值为32.

尝试改为:

In [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')
Out[57]: u'Casta\xf1eda'

放置(?u)在正则表达式的开头是另一种re.UNICODE在正则表达式中指定标志的方法.您也可以通过(?iLmsux)这种方式设置其他标志.(有关详细信息,请单击此链接并搜索"(？iLmsux)".)

同样,呼叫签名re.split是:

re.split(pattern, string, maxsplit=0)

解决方案是一样的.