小编JWS*_*ott的帖子

“charmap”编解码器无法解码位置 33222 中的字节 0x8d:字符映射到 <undefined>

我正在尝试通过 BeautifulSoup 用 lxml 解析一个很长的 html 文件。我知道 html 文件的字符编码是,UTF-8 with BOM但是每当我尝试运行时,contents = f.read()我都会收到以下错误:

'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>

这是我的代码的第一个(也是有问题的)位:

from bs4 import BeautifulSoup

with open("doc.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.h2)
    print(soup.head)
    print(soup.li)
Run Code Online (Sandbox Code Playgroud)

这是错误显示:

    UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-4805460879e0> in <module>
      3 with open("doc.html", "r") as f:
      4 
----> 5     contents = f.read()
      6 
      7     soup = BeautifulSoup(contents, 'lxml')

~\Anaconda3\lib\encodings\cp1252.py in …
Run Code Online (Sandbox Code Playgroud)

html python encoding lxml beautifulsoup

2
推荐指数
1
解决办法
1933
查看次数

标签 统计

beautifulsoup ×1

encoding ×1

html ×1

lxml ×1

python ×1