剥离unicode字符修饰符

Question

剥离unicode字符修饰符

Rap*_*ael 7 python unicode utf-8

从Python中的unicode字符串中剥离字符修饰符的最简单方法是什么？

例如:

A͋͠r͍̞̫̜͌ͦ̈͐t̼̭͞hu̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ͏̣应该成为亚瑟

我尝试了文档,但我找不到任何可以做到这一点.

Answer 1

cge*_*cge 6

试试这个

import unicodedata
a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting.
u"".join( x for x in a if not unicodedata.category(x).startswith("M") )

Run Code Online (Sandbox Code Playgroud)

这将删除所有分类为标记的字符,这是我认为你想要的.通常,您可以使用unicodedata.category获取角色的类别.

+1.但最好在这里使用`.startswith('M')`而不是''M'.从6.1开始,没有任何类别的"M"子类别,但是没有规则说将来不可能存在. (3认同)

Answer 2

jfs*_*jfs 5

你也可以使用regex模块r'\p{M}'支持的:

import regex

def remove_marks(text):
    return regex.sub(ur"\p{M}+", "", text)

Run Code Online (Sandbox Code Playgroud)

例:

>>> print s
A??r????t???h?u????r????
>>> def remove_marks(text):
...     return regex.sub(ur"\p{M}+", "", text)
...     
... 
>>> print remove_marks(s)
Arthur

Run Code Online (Sandbox Code Playgroud)

根据您的使用情况,白名单方法可能更好,例如,仅将输入限制为ascii字符:

>>> s.encode('ascii', 'ignore').decode('ascii')
u'Arthur'

Run Code Online (Sandbox Code Playgroud)

结果可能取决于文本中使用的Unicode规范化.

归档时间：	12 年，7 月前
查看次数：	523 次
最近记录：	12 年，7 月前