如何将 puppeteer 插件与 puppeteer 集群结合起来?

Dag*_*nqx 4 javascript node.js puppeteer puppeteer-cluster

我有一个需要从使用 React 的网站中抓取的 url 列表,因此我使用 Puppeteer。我不想被反机器人服务器阻止,因此我添加了puppeteer-extra-plugin-stealth 我想阻止广告加载到页面上,所以我使用puppeteer-extra-plugin-来阻止广告adblocker 我还想防止我的IP地址被列入黑名单,所以我使用TOR节点来拥有不同的IP地址。下面是我的代码的简化版本,设置有效TOR_port虽然webUrl是动态分配的,但为了简化我的问题,我将其分配为变量)。但有一个问题:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';


const browser = await puppeteer.launch({
    dumpio: false,
    headless: false,
    args: [
        `--proxy-server=socks5://127.0.0.1:${TOR_port}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto(webUrl, {
        waitUntil: 'load',
        timeout: 30000,
    });

    page
    .waitForSelector('.price')
    .then(() => {
        console.log('The price is available');
        await browser.close();
    })
    .catch(() => {
        // close this since it is clearly not a zillow website
        throw new Error('This is not the zillow website');
    });
} catch (e) {
    await browser.close();
}
Run Code Online (Sandbox Code Playgroud)

上面的设置有效,但非常不可靠,我最近了解了Puppeteer-Cluster。我需要它来帮助我管理多个页面的爬行,跟踪我的抓取任务。

所以,我的问题是如何使用上述设置来实现 Puppeteer-Cluster。我知道库提供的一个示例(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/ Different- puppeteer-library.js )来展示如何实现插件,但就是这样只不过我不太明白。

如何使用上述 TOR、AdBlocker 和 Stealth 配置实施 Puppeteer-Cluster?

小智 6

您可以像下面这样交出您的 puppeteer 实例:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

const browser = await puppeteer.launch({
    puppeteer,
});
Run Code Online (Sandbox Code Playgroud)

源代码:https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions