是Python 2.7/3中必需的unicode(codecs.BOM_UTF8,"utf8")？

Question

是Python 2.7/3中必需的unicode(codecs.BOM_UTF8,"utf8")？

Bri*_*unt 6 python unicode byte-order-mark utf-8

在代码审查中,我遇到了以下代码:

# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF)
# as a character.
# If untreated, it can prevent the page from validating or rendering 
# properly. 
bom = unicode( codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

Run Code Online (Sandbox Code Playgroud)

这是一个将字符串传递给Response对象(Django或Flask)的函数.

这仍然是在Python 2.7或3中需要此修复的错误吗？有些东西告诉我它不是,但我想我会问,因为我不太清楚这个问题.

我不知道它来自哪里,但我在互联网上看过它,有时与Jinja2(我们正在使用)相关联.

谢谢阅读.

Answer 1

ekh*_*oro 7

在Unicode标准规定的字符\ufeff有两种不同的含义.在数据流的开头,它应该用作字节顺序和/或编码签名,但在其他地方它应该被解释为零宽度非中断空间.

所以代码

bom = unicode(codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

Run Code Online (Sandbox Code Playgroud)

不只是删除utf-8编码签名(又名BOM) - 它还删除任何嵌入的零宽度不间断空格.

一些早期版本的python没有"utf-8"编解码器的变体,它在读取数据流时会跳过BOM.由于这与其他unicode编解码器不一致,因此在2.5版本中引入了"utf-8-sig"编解码器,它确实跳过了BOM.

所以代码注释中提到的"Python bug"可能与此有关.

但是,"bug"似乎更有可能与嵌入 \ufeff字符有关.但由于Unicode标准清楚地表明它们可以被解释为合法字符,因此数据使用者应该决定如何对待它们 - 因此不是 python中的错误.

归档时间：	14 年，3 月前
查看次数：	2421 次
最近记录：	14 年，3 月前