将 html 保存到文件以便稍后使用 Beautiful Soup 进行使用

Question

将 html 保存到文件以便稍后使用 Beautiful Soup 进行使用

我在 Beautiful Soup 上做了很多工作。然而，我的主管不希望我通过网络“实时”完成工作。相反，他希望我从网页下载所有文本，然后再处理。他希望避免网站被重复点击。

这是我的代码：

import requests
from bs4 import BeautifulSoup

url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ' 
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

Run Code Online (Sandbox Code Playgroud)

我不确定是否应该将“页面”保存为文件，然后将其导入到 Beautiful Soup，或者是否应该将“汤”保存为文件以便稍后打开。我也不知道如何将其另存为文件，以便可以像从互联网“实时”访问一样。我对 Python 几乎一无所知，所以我需要最简单的过程。

Answer 1

小智 6

因此，节省汤将是......困难的，并且超出了我的经验（pickle如果有兴趣，请阅读有关 ing 过程的更多信息）。您可以按如下方式保存页面：

page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
    f.write(page.content)

Run Code Online (Sandbox Code Playgroud)

然后当你想对其进行分析时：

with open('path/to/saving.html', 'rb') as f:
    soup = BeautifulSoup(f.read(), 'lxml')

Run Code Online (Sandbox Code Playgroud)

无论如何，类似的事情。

归档时间：	4 年，5 月前
查看次数：	17798 次
最近记录：	2 年，11 月前