windows python上的utf-8

Question

windows python上的utf-8

tas*_*pai 5 python unicode utf-8 python-3.x

我有 html 文件来读取解析等，它是在 unicode 上编码的（我用记事本看到的）但是当我尝试时

infile = open("path", "r") 
infile.read()

Run Code Online (Sandbox Code Playgroud)

它失败了，我遇到了著名的错误：

UnicodeEncodeError: 'charmap' 编解码器无法对位置 xx 中的字符进行编码：字符映射到未定义

因此，为了进行测试，我尝试将文件的包含复制粘贴到一个新文件中并将其保存在 utf-8 中，然后尝试使用这样的编解码器打开它：

inFile = codecs.open("path", "r", encoding="utf-8")
outputStream = inFile.read()

Run Code Online (Sandbox Code Playgroud)

但我收到此错误消息：

UnicodeEncodeError : 'charmap' 编解码器无法对位置 0 的字符 u'\ufeff' 进行编码：字符映射到未定义

我真的不明白，因为我是用 utf8 创建的这个文件。

Answer 1

jfs*_*jfs 6

UnicodeEncodeError表明代码在将Unicode 文本编码为字节时失败，即您的实际代码尝试打印到 Windows 控制台。请参阅Python、Unicode 和 Windows 控制台。

上面的链接修复了UnicodeEncodeError。下一个问题是找出"path"文件中的文本使用什么字符编码。如果notepad.exe正确显示文本，则意味着它是使用locale.getprefferedencoding(False)（类似于cp1252Windows 上的东西）编码的，或者文件具有BOM。

如果确定编码是utf-8那就open()直接传过去。不要使用codecs.open()：

with open('path', encoding='utf-8') as file:
    html = file.read()

Run Code Online (Sandbox Code Playgroud)

有时，输入可能包含使用多种（不一致）编码进行编码的文本，例如，可以使用智能引号进行编码，cp1252而 html 的其余部分是 utf-8 - 您可以使用来修复它bs4.UnicodeDammit。另请参阅在 Python 中获取 HTTP 响应的字符集/编码的好方法

归档时间：	10 年，2 月前
查看次数：	13538 次
最近记录：	10 年，1 月前