Puppeteer: how to download entire web page for offline use

Question

Puppeteer: how to download entire web page for offline use

Coo*_*654 4 html javascript css web-scraping puppeteer

How would I scrape an entire website, with all of its css/javascript/media intact (and not just its HTML), with google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

Run Code Online (Sandbox Code Playgroud)

and saving the results, but that saves a copy without any non-HTML elements.

Is there way to save webpages for offline use with Puppeteer? I would appreciate any advice.

Answer 1

vse*_*byt 6

目前可以通过'Page.captureSnapshot'使用MHTML格式的实验性CDP调用来实现：

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年前
查看次数：	943 次
最近记录：	7 年前