Apify可以从sitemap.xml中抓取链接
const Apify = require('apify');
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
});
await requestList.initialize();
const crawler = new Apify.PuppeteerCrawler({
requestList,
handlePageFunction: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
await Apify.pushData({
url: request.url,
title: await page.title(),
html: await page.content(),
});
},
});
await crawler.run();
console.log('Done.');
});
Run Code Online (Sandbox Code Playgroud)
https://sdk.apify.com/docs/examples/puppeteersitemap#docsNav
但是,如果我使用 requestQueue,我不确定如何从 sitemap.xml 中抓取链接。例如:
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({url: "https://google.com});
//this is not working. Apify is simply crawling sitemap.xml
//and not adding urls from sitemap.xml to requestQueue
await requestQueue.addRequest({url:`https://google.com/sitemap.xml`});
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
// This function is called for every page the crawler visits
handlePageFunction: async (context) => {
const {request, page} = context;
const title = await page.title();
let page_url = request.url;
console.log(`Title of ${page_url}: ${title}`);
await Apify.utils.enqueueLinks({
page, selector: 'a', pseudoUrls, requestQueue});
},
});
await crawler.run();
Run Code Online (Sandbox Code Playgroud)
Apify 的伟大之处在于您可以同时使用RequestList和RequestQueue。在这种情况下,当您抓取时,项目将从列表中取出到队列中(不会使队列超载)。通过使用两者,您将获得两全其美的效果。
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
});
await requestList.initialize();
const requestQueue = await Apify.openRequestQueue();
const crawler = new Apify.PuppeteerCrawler({
requestList,
requestQueue,
handlePageFunction: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
// This is just an example, define your logic
await Apify.utils.enqueueLinks({
page, selector: 'a', pseudoUrls: null, requestQueue,
});
await Apify.pushData({
url: request.url,
title: await page.title(),
html: await page.content(),
});
},
});
await crawler.run();
console.log('Done.');
});
Run Code Online (Sandbox Code Playgroud)
如果您只想使用队列,则需要自己解析 XML。当然,这并不是什么大问题。您可以在爬虫之前或使用Cheerio轻松解析它Apify.CheerioCrawler
无论如何,我们建议使用RequestList批量 url,因为它基本上是立即在内存中创建的,但队列实际上是一个数据库(或本地的 JSON 文件)。
| 归档时间: |
|
| 查看次数: |
1926 次 |
| 最近记录: |