我正在尝试通过 BeautifulSoup 用 lxml 解析一个很长的 html 文件。我知道 html 文件的字符编码是,UTF-8 with BOM但是每当我尝试运行时,contents = f.read()我都会收到以下错误:'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>
这是我的代码的第一个(也是有问题的)位:
from bs4 import BeautifulSoup
with open("doc.html", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
print(soup.h2)
print(soup.head)
print(soup.li)
Run Code Online (Sandbox Code Playgroud)
这是错误显示:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-4805460879e0> in <module>
3 with open("doc.html", "r") as f:
4
----> 5 contents = f.read()
6
7 soup = BeautifulSoup(contents, 'lxml')
~\Anaconda3\lib\encodings\cp1252.py in …Run Code Online (Sandbox Code Playgroud)