组合变音符号不会与unicodedata标准化.标准化(PYTHON)

Question

组合变音符号不会与unicodedata标准化.标准化(PYTHON)

Bem*_*mis 3 python unicode replace diacritics

我理解unicodedata.normalize将变音符号转换为非变音符号:

import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf') 
            if unicodedata.category(c) != 'Mn'
       )

Run Code Online (Sandbox Code Playgroud)

我的问题是(并且可以在这个例子中看到):unicodedata有没有办法将组合的char变音符号替换为对应的？(你'成'''')

如果不是,我认为我将不得不为这些打击,但我不得不与所有的uchars和他们的同行编译我自己的dict并unicodedata完全忘记...

Answer 1

Gar*_*ees 6

你的问题中的术语有点混乱.一个音调符号是可以添加到一个字母或其它字符但一般不站在自己的标志.(Unicode也使用更通用的术语组合字符.)normalize('NFD', ...)将预组合字符转换为其组件的作用是什么.

无论如何,答案是 - 不是预先组合的角色.这是一个印刷结扎:

>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'

Run Code Online (Sandbox Code Playgroud)

该unicodedata模块没有提供将连字分成其部分的方法.但是数据存在于字符名称中:

import re
import unicodedata

_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')

def split_ligatures(s):
    """
    Split the ligatures in `s` into their component letters. 
    """
    def untie(l):
        m = _ligature_re.match(unicodedata.name(l))
        if not m: return l
        elif m.group(1): return m.group(2)
        else: return m.group(2).lower()
    return ''.join(untie(l) for l in s)

>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'

Run Code Online (Sandbox Code Playgroud)

(当然你不会在实践中这样做:你会预先处理Unicode数据库以生成你在问题中建议的查找表.在Unicode中没有那么多的连接.)

归档时间：	13 年，5 月前
查看次数：	696 次
最近记录：	13 年，5 月前