如何将 BeautifulSoup 对象保存到文件中，然后以 BeautifulSoup 的形式从中读取？

Question

如何将 BeautifulSoup 对象保存到文件中，然后以 BeautifulSoup 的形式从中读取？

我想将 BeautifulSoup 对象保存到文件中。因此，我将其更改为字符串，然后将其写入文件。然后将其作为字符串读取后，我将字符串转换为 BeautifulSoup 对象。这在我的测试过程中会有所帮助，因为我正在抓取的数据是动态的。

url = "https://coinmarketcap.com/all/views/all/"
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")

Run Code Online (Sandbox Code Playgroud)

像这样编写 soup 对象：

  new_soup = str(soup)
  with open("coin.txt", "w+") as f:
      f.write(new_soup)

Run Code Online (Sandbox Code Playgroud)

产生这个错误：

UnicodeEncodeError: 'charmap' codec can't encode 
characters in position 28127-28132: character maps to <undefined>

Run Code Online (Sandbox Code Playgroud)

另外，如果我能够将其保存到文件中，我将如何读取作为 BeautifulSoup 对象返回的字符串？

Answer 1

Edg*_*gón 5

编辑

旧代码无法腌制soup对象，因为RecursionError：

Traceback (most recent call last):
  File "soup.py", line 20, in <module>
    pickle.dump(soup, f)
RecursionError: maximum recursion depth exceeded while calling a Python object

Run Code Online (Sandbox Code Playgroud)

解决方案是增加递归限制。他们在这个答案中做了同样的事情，而答案又引用了文档。

然而，您尝试加载和保存的特定站点是极其嵌套的。我的计算机无法超过 50000 次递归限制，这对于您的网站来说不够，并且崩溃了：10008 segmentation fault (core dumped) python soup.py。

因此，如果您需要下载 HTML 并稍后使用它，您可以这样做：

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://coinmarketcap.com/all/views/all/"
html = urlopen(url)

# Save HTML to a file
with open("soup.html", "wb") as f:
    while True:
        chunk = html.read(1024)
        if not chunk:
            break
        f.write(chunk)

Run Code Online (Sandbox Code Playgroud)

然后你可以读取保存的 HTML 文件并用它实例化 bs4 对象：

# Read HTML from a file
with open("soup.html", "rb") as f:
    soup = BeautifulSoup(f.read(), "lxml")

print(soup.title)
# <title>All Cryptocurrencies | CoinMarketCap</title>

Run Code Online (Sandbox Code Playgroud)

此外，这是我用于较少嵌套站点的代码：

import pickle
from bs4 import BeautifulSoup
from urllib.request import urlopen
import sys

url = "/sf/ask/3708159031/"
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")

sys.setrecursionlimit(8000)

# Save the soup object to a file
with open("soup.pickle", "wb") as f:
    pickle.dump(soup, f)

# Read the soup object from a file
with open("soup.pickle", "rb") as f:
    soup_obj = pickle.load(f)

print(soup_obj.title)

# <title>python - How to save the BeautifulSoup object to a file and then read from it as BeautifulSoup? - Stack Overflow</title>.

Run Code Online (Sandbox Code Playgroud)

当我运行此代码时，我收到“RecursionError：pickle an object 时超出最大递归深度”，因为该对象很大。 (2认同)

归档时间：	7 年，4 月前
查看次数：	7365 次
最近记录：	7 年，4 月前