小编Sau*_*abh的帖子

使用 Apify Puppeteer 和 requestQueue 从 sitemap.xml 抓取 url

Apify可以从sitemap.xml中抓取链接

const Apify = require('apify');

Apify.main(async () => {
    const requestList = new Apify.RequestList({
        sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
    });
    await requestList.initialize();

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction: async ({ page, request }) => {
            console.log(`Processing ${request.url}...`);
            await Apify.pushData({
                url: request.url,
                title: await page.title(),
                html: await page.content(),
            });
        },
    });

    await crawler.run();
    console.log('Done.');
});
Run Code Online (Sandbox Code Playgroud)

https://sdk.apify.com/docs/examples/puppeteersitemap#docsNav

但是,如果我使用 requestQueue,我不确定如何从 sitemap.xml 中抓取链接。例如:

const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({url: "https://google.com});

 //this is not working. Apify is simply crawling sitemap.xml 
 //and not adding …
Run Code Online (Sandbox Code Playgroud)

puppeteer apify

3
推荐指数
1
解决办法
1926
查看次数

标签 统计

apify ×1

puppeteer ×1