为什么 'é' 和 'é' 编码为不同的字节？

Question

为什么 'é' 和 'é' 编码为不同的字节？

Ale*_*sen 5 python unicode normalization python-3.x

题

为什么在我的代码库的不同部分将相同的字符编码为不同的字节？

语境

我有一个单元测试，它生成一个临时文件树，然后检查以确保我的扫描确实找到了有问题的文件。

def test_unicode_file_name():
    test_regex = "é"
    file_tree = {"files": ["é"]} # File created with python.open()
    with TempTree(file_tree) as tmp_tree:
        import pdb; pdb.set_trace()
        result = tasks.find_files(test_regex, root_path=tmp_tree.root_path)
        expected = [os.path.join(tmp_tree.root_path, "é")]
        assert result == expected

Run Code Online (Sandbox Code Playgroud)

失败的功能

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = dir_entry.name
        if filename_regex.match(testing):
            results.append(dir_entry.path)

Run Code Online (Sandbox Code Playgroud)

PDB 会话

当我开始深入研究时，我发现测试字符（从我的单元测试中复制）和dir_entry.name编码为不同字节的字符。

(Pdb) testing
'é'
(Pdb) 'é'
'é'
(Pdb) testing == 'é'
False
(Pdb) testing in 'é'
False
(Pdb) type(testing)
<class 'str'>
(Pdb) type('é')
<class 'str'>
(Pdb) repr(testing)
"'é'"
(Pdb) repr('é')
"'é'"
(Pdb) 'é'.encode("utf-8")
b'\xc3\xa9'
(Pdb) testing.encode("utf-8")
b'e\xcc\x81'

Run Code Online (Sandbox Code Playgroud)

Answer 1

Zer*_*eus 6

您的操作系统（猜测是 MacOS）已将文件名转换'é'为Unicode Normal Form D，将其分解为非重音'e'和重音组合。您可以通过 Python 解释器中的快速会话清楚地看到这一点：

>>> import unicodedata
>>> e1 = b'\xc3\xa9'.decode()
>>> e2 = b'e\xcc\x81'.decode()
>>> [unicodedata.name(c) for c in e1]
['LATIN SMALL LETTER E WITH ACUTE']
>>> [unicodedata.name(c) for c in e2]
['LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']

Run Code Online (Sandbox Code Playgroud)

为确保您将 like 与 like 进行比较，您可以dir_entry.name在针对正则表达式对其进行测试之前，将给出的文件名转换回范式 C：

import unicodedata

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = unicodedata.normalize('NFC', dir_entry.name)
        if filename_regex.match(testing):
            results.append(dir_entry.path)

Run Code Online (Sandbox Code Playgroud)

NFC 更紧凑，并且是您最常发现的形式（除了 Apple 的古怪文件系统决定）——例如，这显然是您在编辑器中键入“é”时得到的。这也是 [W3C 推荐的](https://www.w3.org/TR/charmod-norm/#choice-of-normalization-form) 的形式，无论值多少钱。 (3认同)

归档时间：	8 年，11 月前
查看次数：	1868 次
最近记录：	8 年，11 月前