从 python I/O 输出到文件的 Unicode 字符

Question

从 python I/O 输出到文件的 Unicode 字符

Wil*_*man 2 unicode utf-8 python-2.7

我不知道这是我对 UTF-8 还是 python 的误解，但我无法理解 python 如何将 Unicode 字符写入文件。顺便说一下，我使用的是 OSX 下的 Mac，如果这有什么区别的话。

\n\n

假设我有以下 unicode 字符串

\n\n

foo=u'\\x93Stuff in smartquotes\\x94\\n'

\n\n

这里 \\x93 和 \\x94 是那些糟糕的智能引号。

\n\n

然后我将其写入文件：

\n\n

with open('file.txt','w') as file:\n file.write(foo.encode('utf8'))

\n\n

当我在 TextWrangler 等文本编辑器或网络浏览器中打开文件时，file.txt它看起来像是写成

\n\n

\n
\\xc2\\x93**smartquotes 中的内容\\xc2\\x94\\n
\n

\n\n

文本编辑器正确地理解该文件是 UTF8 编码的，但它将 \\xc2\\x93 呈现为垃圾。如果我进入并手动删除 \\xc2 部分，我会得到我所期望的结果，并且 TextWrangler 和 Firefox 将 utf 字符呈现为智能引号。

\n\n

这正是我将文件读回 python 而不将其解码为“utf8”时得到的结果。然而，当我用该方法读入它时read().decode('utf8')，我得到了我最初输入的内容，没有 \\xc2 位。

\n\n

这让我抓狂，因为我试图将一堆 html 文件解析为文本，而这些 unicode 字符的错误渲染导致一堆东西搞砸了。

\n\n

我也在 python3 中尝试使用正常的读/写方法，它具有相同的行为。

\n\n

编辑：关于手动删除 \\xc2 ，事实证明，当我这样做时它渲染正确，因为浏览器和文本编辑器默认为拉丁编码。

\n\n

另外，作为后续操作，Filefox 将文本呈现为

\n\n

\n
\xe2\x98\x90smartquotes\xe2\x98\x90 中的内容
\n

\n\n

其中框是空的 unicode 值，而 Chrome 将文本呈现为

\n\n

\n
智能引号中的内容
\n

\n

Answer 1

Mar*_*nen 6

问题是，u\'\\x93\'并且u\'\\x94\'不是智能引号的 Unicode 代码点。它们是Windows-1252编码中的智能引号，与编码不一样latin1。在中latin1，这些值没有定义。

\n\n

>>> import unicodedata as ud\n>>> ud.name(u\'\\x93\')\nTraceback (most recent call last):\n  File "<interactive input>", line 1, in <module>\nValueError: no such name\n>>> import unicodedata as ud\n>>> ud.name(u\'\\x94\')\nTraceback (most recent call last):\n  File "<interactive input>", line 1, in <module>\nValueError: no such name\n>>> ud.name(u\'\\u201c\')\n\'LEFT DOUBLE QUOTATION MARK\'\n>>> ud.name(u\'\\u201d\')\n\'RIGHT DOUBLE QUOTATION MARK\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

所以你应该选择以下之一：

\n\n

foo = u\'\\u201cStuff in smartquotes\\u201d\'\nfoo = u\'\\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\\N{RIGHT DOUBLE QUOTATION MARK}\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

或者在 UTF-8 源文件中：

\n\n

#coding:utf8\nfoo = u\'\xe2\x80\x9cStuff in smartquotes\xe2\x80\x9d\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

编辑：如果您的 Unicode 字符串中包含不正确的字节，这里有一种修复它们的方法。前 256 个 Unicode 代码点与latin1编码映射 1:1，因此可用于将错误解码的 Unicode 字符串直接编码回字节字符串，以便可以使用正确的解码：

\n\n

>>> foo = u\'\\x93Stuff in smartquotes\\x94\'\n>>> foo\n\'\\x93Stuff in smartquotes\\x94\'\n>>> foo.encode(\'latin1\').decode(\'windows-1252\')\n\'\\u201cStuff in smartquotes\\u201d\'\n>>> print foo\n\xe2\x80\x9cStuff in smartquotes\xe2\x80\x9d\n

Run Code Online (Sandbox Code Playgroud)\n\n

如果您有不正确的 Unicode 字符的 UTF-8 编码版本：

\n\n

>>> foo = \'\\xc2\\x93Stuff in smartquotes\\xc2\\x94\'\n>>> foo = foo.decode(\'utf8\').encode(\'latin1\').decode(\'windows-1252\')\n>>> foo\nu\'\\u201cStuff in smartquotes\\u201d\'\n>>> print foo\n\xe2\x80\x9cStuff in smartquotes\xe2\x80\x9d\n

Run Code Online (Sandbox Code Playgroud)\n\n

如果最坏的情况是以下 Unicode 字符串：

\n\n

>>> foo = u\'\\xc2\\x93Stuff in smartquotes\\xc2\\x94\'\n>>> foo.encode(\'latin1\') # back to a UTF-8 encoded byte string.\n\'\\xc2\\x93Stuff in smartquotes\\xc2\\x94\'\n>>> foo.encode(\'latin1\').decode(\'utf8\') # Undo the UTF-8, but Unicode is still wrong.\nu\'\\x93Stuff in smartquotes\\x94\'\n>>> foo.encode(\'latin1\').decode(\'utf8\').encode(\'latin1\') # back to a byte string.\n\'\\x93Stuff in smartquotes\\x94\'\n>>> foo.encode(\'latin1\').decode(\'utf8\').encode(\'latin1\').decode(\'windows-1252\') # Now decode correctly.\nu\'\\u201cStuff in smartquotes\\u201d\'\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	10 年，2 月前
查看次数：	3907 次
最近记录：	10 年，2 月前