如何在Scrapy中按所需顺序或同步进行爬网？

Question

如何在Scrapy中按所需顺序或同步进行爬网？

Bus*_*cio 5 python scrapy

问题

我正在尝试创建一个蜘蛛，从商店中抓取并抓取每个产品并将结果输出到JSON文件，其中包括进入主页的每个类别以及抓取每个产品（名称和价格），每个产品类别页面包括无限滚动。

我的问题是，每次我在刮擦一类物品的第一页后发出请求时，不是从相同类型的物品中获取下一批物品，而是从下一个类别中获得物品，并且输出最终变得一团糟。

我已经尝试过的

我已经尝试过弄乱设置，将并发请求强制为一个，并为每个请求设置不同的优先级。

我发现了有关异步爬网的信息，但是我不知道如何按顺序创建请求。

码

import scrapy
from scrapper_pccom.items import ScrapperPccomItem

class PccomSpider(scrapy.Spider):
    name = 'pccom'
    allowed_domains = ['pccomponentes.com']
    start_urls = ['https://www.pccomponentes.com/componentes']

    #Scrapes links for every category from main page
    def parse(self, response):
        categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')
        prio = 20
        for category in categories:
            url = response.urljoin(category.extract())
            yield scrapy.Request(url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})
            prio = prio - 1

    #Scrapes products from every page of each category      
    def parse_item_list(self, response, prio):

        products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
        for product in products:
            item = ScrapperPccomItem()
            item['name'] = product.xpath('@data-name').extract()
            item['price'] = product.xpath('@data-price').extract()
            yield item

        #URL of the next page
        next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
        if next_page:
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})

Run Code Online (Sandbox Code Playgroud)

产出与预期

它的作用是：类别1第1页> Cat 2第1页> Cat 3第1页> ...

我想要它做什么：Cat 1第1页> Cat 1第2页> Cat 1第3页> ...> Cat 2第1页

Answer 1

Uma*_*air 3

这很容易，

获取中所有类别的列表all_categories，现在不要抓取所有链接，只需抓取第一个类别链接，一旦该类别的所有页面都被抓取，然后将请求发送到另一个类别链接。

这是代码，我没有运行代码，所以可能有一些语法错误，但逻辑是你需要的

class PccomSpider(scrapy.Spider):
    name = 'pccom'
    allowed_domains = ['pccomponentes.com']
    start_urls = ['https://www.pccomponentes.com/componentes']

    all_categories = []

    def yield_category(self):
        if self.all_categories:
            url = self.all_categories.pop()
            print("Scraping category %s " % (url))
            return scrapy.Request(url, self.parse_item_list)
        else:
            print("all done")


    #Scrapes links for every category from main page
    def parse(self, response):
        categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')

        self.all_categories = list(response.urljoin(category.extract()) for category in categories)
        yield self.yield_category()


    #Scrapes products from every page of each category      
    def parse_item_list(self, response, prio):

        products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
        for product in products:
            item = ScrapperPccomItem()
            item['name'] = product.xpath('@data-name').extract()
            item['price'] = product.xpath('@data-price').extract()
            yield item

        #URL of the next page
        next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
        if next_page:
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url, self.parse_item_list)

        else:
            print("All pages of this category scraped, now scraping next category")
            yield self.yield_category()

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，2 月前
查看次数：	67 次
最近记录：	6 年，2 月前