将表情符号视为正则表达式中的一个字符

Question

将表情符号视为正则表达式中的一个字符

nai*_*eai 10 python regex python-2.7 python-unicode unicode-literals

这是一个小例子:

reg = ur"((?P<initial>[+\-])(?P<rest>.+?))$"

Run Code Online (Sandbox Code Playgroud)

(在这两种情况下文件都有-*- coding: utf-8 -*-)

在Python 2中:

re.match(reg, u"hello").groupdict()
# => {u'initial': u'\ud83d', u'rest': u'\udc4dhello'}
# unicode why must you do this

Run Code Online (Sandbox Code Playgroud)

然而,在Python 3中:

re.match(reg, "hello").groupdict()
# => {'initial': '', 'rest': 'hello'}

Run Code Online (Sandbox Code Playgroud)

上述行为是100%完美,但切换到Python 3目前不是一个选项.将3的结果复制到2中的最佳方法是什么,这适用于窄版和宽版Python？似乎是以"\ ud83d\udc4d"格式来找我,这就是让这个变得棘手的原因.

Answer 1

Mar*_*nen 5

在 Python 2 窄版本中，非 BMP 字符是两个代理代码点，因此您无法在[]语法中正确使用它们。 u'[]相当于u'[\ud83d\udc4d]'，这意味着“匹配或之一。Python 2.7 示例：\ud83d\udc4d

>>> u'\U0001f44d' == u'\ud83d\udc4d' == u''
True
>>> re.findall(u'[]',u'')
[u'\ud83d', u'\udc4d']

Run Code Online (Sandbox Code Playgroud)

要在 Python 2 和 3 中修复，请匹配u'OR [+-]。这会在 Python 2 和 3 中返回正确的结果：

#coding:utf8
from __future__ import print_function
import re

# Note the 'ur' syntax is an error in Python 3, so properly
# escape backslashes in the regex if needed.  In this case,
# the backslash was unnecessary.
reg = u"((?P<initial>|[+-])(?P<rest>.+?))$"

tests = u'hello',u'-hello',u'+hello',u'\\hello'
for test in tests:
    m = re.match(reg,test)
    if m:
        print(test,m.groups())
    else:
        print(test,m)

Run Code Online (Sandbox Code Playgroud)

输出（Python 2.7）：

hello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
-hello (u'-hello', u'-', u'hello')
+hello (u'+hello', u'+', u'hello')
\hello None

Run Code Online (Sandbox Code Playgroud)

输出（Python 3.6）：

hello ('hello', '', 'hello')
-hello ('-hello', '-', 'hello')
+hello ('+hello', '+', 'hello')
\hello None

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，9 月前
查看次数：	622 次
最近记录：	7 年，9 月前