Dav*_*ars 3 pagination node.js web-scraping puppeteer
我正在使用Puppeteer构建基本的网络抓取工具,到目前为止,我可以从任何给定页面返回我需要的所有数据,但是,当涉及到分页时,我的抓取工具就不会卡住(仅返回第一页)。
查看示例-返回第20本书的书名/价格,但不查看其他49页的书。
只是寻找有关如何克服这一问题的指导-我在文档中看不到任何内容。
谢谢!
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
const result = await page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements){
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({title, price});
}
return data;
});
browser.close();
return result;
};
scrape().then((value) => {
console.log(value);
});
Run Code Online (Sandbox Code Playgroud)
要清楚。我在这里遵循一个教程-此代码来自Brandbur Morelli在codeburst.io上!https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921
我在关注同一篇文章,目的是教育自己如何使用Puppeteer。关于这个问题的简短答案是,您需要引入一个循环来遍历在线图书目录中的所有可用页面。我已完成以下步骤,以收集所有书名和价格:
page.evaluate在单独的异步函数中提取的部分以页面为参数与Brandon Morelli文章中的代码完全相同,但现在有了一个额外的循环:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
var results = []; // variable to hold collection of all book titles and prices
var lastPageNumber = 50; // this is hardcoded last catalogue page, you can set it dunamically if you wish
// defined simple loop to iterate over number of catalogue pages
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(1000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await extractedEvaluateCall(page));
// this is where next button on page clicked to jump to another page
if (index != lastPageNumber - 1) {
// no next button on last page
await page.click('#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a');
}
}
browser.close();
return results;
};
async function extractedEvaluateCall(page) {
// just extracted same exact logic in separate function
// this function should use async keyword in order to work and take page as argument
return page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements) {
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({ title, price });
}
return data;
});
}
scrape().then((value) => {
console.log(value);
console.log('Collection length: ' + value.length);
console.log(value[0]);
console.log(value[value.length - 1]);
});
Run Code Online (Sandbox Code Playgroud)
控制台输出:
...
{ title: 'In the Country We ...', price: '£22.00' },
... 900 more items ]
Collection length: 1000
{ title: 'A Light in the ...', price: '£51.77' }
{ title: '1,000 Places to See ...', price: '£26.08' }
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1751 次 |
| 最近记录: |