Puppeteer，保存网页和图片

Question

Puppeteer，保存网页和图片

Joh*_*sma 7 html javascript node.js web-scraping puppeteer

我正在尝试保存一个网页，以供 Nodejs 和 puppeteer 离线使用。我看到很多例子：

await page.screenshot({path: 'example.png'});

Run Code Online (Sandbox Code Playgroud)

但是对于更大的网页，这不是一种选择。因此，在 puppeteer 中更好的选择是加载页面，然后像这样保存：

const html = await page.content();
// ... write to file

Run Code Online (Sandbox Code Playgroud)

好的，那行得通。现在我要像推特一样滚动页面。所以我决定屏蔽 puppeteer 页面中的所有图片：

page.on('request', request => {
    if (request.resourceType() === 'image') {
        const imgUrl = request.url()
        download(imgUrl, 'download').then((output) => {
            images.push({url: output.url, filename: output.filename})
        }).catch((err) => {
            console.log(err)
        })
        request.abort()
    } else {
        request.continue()
    }
})

Run Code Online (Sandbox Code Playgroud)

好的，我现在使用“npm 下载”库来下载所有图像。是的，下载图像没问题：D。

现在，当我保存内容时，我想将其指向源中的离线图像。

const html = await page.content();

Run Code Online (Sandbox Code Playgroud)

但现在我喜欢替换所有的

<img src="/pic.png?id=123"> 
<img src="https://twitter.com/pics/1.png">

Run Code Online (Sandbox Code Playgroud)

还有这样的事情：

<div style="background-image: url('this_also.gif')></div>

Run Code Online (Sandbox Code Playgroud)

那么有没有办法（在 puppeteer 中）抓取一个大页面并离线存储整个内容？

Javascript 和 CSS 也不错

更新

现在我将再次使用 puppeteer 打开大 html 文件。

然后截取所有文件为： https://dom.com/img/img.jpg , /file.jpg, ....

request.respond({
    status: 200,
    contentType: 'image/jpeg',
    body: '..'
});

Run Code Online (Sandbox Code Playgroud)

我也可以用镀铬扩展来做到这一点。但我喜欢有一些选项 page.html() 的功能，与 page.pdf() 相同

Answer 1

小智 10

让我们回到第一个，你可以用来fullPage截图。

await page.screenshot({path: 'example.png', fullPage: true});

Run Code Online (Sandbox Code Playgroud)

如果您确实想将所有资源下载到离线状态，可以：

const fse = require('fs-extra');

page.on('response', (res) => {
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});

Run Code Online (Sandbox Code Playgroud)

然后，你就可以通过puppeteer离线浏览网站了，一切正常。

await page.setRequestInterception(true);
page.on('request', (req) => {
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond({
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    });
});

Run Code Online (Sandbox Code Playgroud)

更好地依赖“requestfinished”事件。 (4认同)

归档时间：	7 年，2 月前
查看次数：	11998 次
最近记录：	6 年前