如何在python中的代码点上拆分unicode字符串？(例如,\ u00B7或\ u2022)？

Question

如何在python中的代码点上拆分unicode字符串？(例如,\ u00B7或\ u2022)？

ani*_*etd 0 python unicode split codepoint points

我尝试了一切我能想到的......

1. unicode_obj.split('\u2022')
2. re.split(r'\u2022', unicode_object)
3. re.split(r'(?iu)\u2022', unicode_object)

Run Code Online (Sandbox Code Playgroud)

没有任何效果

问题是我想拆分特殊字符.

example string : u'<special char like middot:\u00b7 or bullet:\u2022> sdfhsdf <repeat special char> sdfjhdgndujhfsgkljng <repeat special char> ... etc'

Run Code Online (Sandbox Code Playgroud)

请帮忙.

提前致谢.

Answer 1

Jea*_*one 8

考虑:

>>> print '\u2022'
\u2022
>>> print len('\u2022')
6
>>> import unicodedata
>>> map(unicodedata.name, '\u2022'.decode('ascii'))
['REVERSE SOLIDUS', 'LATIN SMALL LETTER U', 'DIGIT TWO', 'DIGIT ZERO', 'DIGIT TWO', 'DIGIT TWO']
>>>

Run Code Online (Sandbox Code Playgroud)

VS:

>>> print u'\u2022'
•
>>> print len(u'\u2022')
1
>>> map(unicodedata.name, u'\u2022')
['BULLET']
>>>

Run Code Online (Sandbox Code Playgroud)

这应该区分text.split('\u2022')和text.split(u'\u2022')明确.

归档时间：	14 年，5 月前
查看次数：	1809 次
最近记录：	14 年，5 月前