Apify可以从sitemap.xml中抓取链接
const Apify = require('apify');
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
});
await requestList.initialize();
const crawler = new Apify.PuppeteerCrawler({
requestList,
handlePageFunction: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
await Apify.pushData({
url: request.url,
title: await page.title(),
html: await page.content(),
});
},
});
await crawler.run();
console.log('Done.');
});
Run Code Online (Sandbox Code Playgroud)
https://sdk.apify.com/docs/examples/puppeteersitemap#docsNav
但是,如果我使用 requestQueue,我不确定如何从 sitemap.xml 中抓取链接。例如:
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({url: "https://google.com});
//this is not working. Apify is simply crawling sitemap.xml
//and not adding …Run Code Online (Sandbox Code Playgroud)