Rob*_*dru 2 javascript web-scraping headless-browser puppeteer
我在 Puppeteer 方面遇到了一些问题,我想提取一个项目列表,并在 headless 为 FALSE 时成功,但在为 TRUE 时则不然。
首先,我想在映射之前获取这些元素。
这是我的脚本,也许你可以复制它,它非常基本。
const chalk = require("chalk");
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl + searchTerm;
(async () => {
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [`--window-size=1920,1080`],
defaultViewport: {
width: 1920,
height: 1080,
},
});
const page = await browser.newPage();
// Begin navigation
console.log(chalk.yellow("Beginning navigation."));
await page.goto(searchUrl);
// Await List of elements;
console.log(chalk.yellow("Wait for Network Idle..."));
await page.waitForNetworkIdle();
// get Items
const findElements = await page.evaluate(() => {
const elements = document.querySelectorAll(".sale-item");
console.log(elements);
return elements;
});
console.log(findElements);
console.log(chalk.blue("Waiting..."));
await page.waitForTimeout(10000);
await browser.close();
console.log(chalk.red("Closed."));
})();
Run Code Online (Sandbox Code Playgroud)
Expected results : {
'0': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'1': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'2': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'3': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'4': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
.
.
}
Run Code Online (Sandbox Code Playgroud)
对于初学者来说,我page.waitForSelector(yourSelector)更喜欢page.waitForNetworkIdle();. 在大多数情况下,它可以更直接地保证您想要的数据位于页面上,而网络空闲可以阻止等待与您尝试抓取的数据完全无关的各种请求。另一种选择是page.waitForResponse(predicate)。
有些网站会检查标题以阻止抓取工具。您可以尝试按照 Puppeteer GitHub 问题中所述更改用户代理标头{ headless: false } 和 { headless: true } #665 之间的不同行为:
const puppeteer = require("puppeteer"); // ^19.6.3
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl + encodeURIComponent(searchTerm);
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const ua =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
await page.setUserAgent(ua);
await page.goto(searchUrl, {waitUntil: "domcontentloaded"});
await page.waitForSelector(".sale-item");
const elements = await page.$$(".sale-item");
console.log(elements.length); // => 48
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Run Code Online (Sandbox Code Playgroud)
使用puppeteer-extra中所述,为什么 headless 需要为 false 才能使 Puppeteer 工作?是您可以尝试的另一种选择。除其他外,它还使用随机浏览器用户代理标头。