我想制作搜索引擎,并在一些网站上关注教程.我想测试解析HTML
from bs4 import BeautifulSoup
def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
with open(filename) as infile:
html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
d = {'text': html.pre.text}
if html.title is not None:
d['title'] = html.title.text
for meta in html.findAll('meta'):
try:
if meta['name'] in ('Author', 'Title'):
d[meta['name'].lower()] = meta['content']
except KeyError:
continue
return d
parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")
Run Code Online (Sandbox Code Playgroud)
它收到错误
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here
Run Code Online (Sandbox Code Playgroud)
我使用encode()在Web上看到了一些解决方案.但我不知道如何在代码中插入encode()函数.谁能帮我?
Mar*_*ers 48
在Python 3中,文件以文本形式打开(解码为Unicode); 你不需要告诉BeautifulSoup要解码的编解码器.
如果数据解码失败,那是因为你没有告诉open()调用在读取文件时使用什么编解码器; 使用encoding参数添加正确的编解码器:
with open(filename, encoding='utf8') as infile:
html = BeautifulSoup(infile, "html.parser")
Run Code Online (Sandbox Code Playgroud)
否则将使用您的系统默认编解码器打开该文件,该编解码器取决于操作系统.
| 归档时间: |
|
| 查看次数: |
58848 次 |
| 最近记录: |