我想使用Scrapy(了解cb_kwargs)按项目总结多个页面的信息

K_M*_*_MM 0 python web-crawler scrapy web-scraping

目标:我想检索特定电子商务网站上发布的订单绩效数据。由于每个订单绩效的这些数据分布在多个页面上,因此我们希望提取每个页面的信息,并最终将它们汇总为单个项目或记录。

\n

我浏览了官方文档和其他类似的质量检查并找到了一些。\n从这些信息中,我了解到可以通过使用 cb_kwargs 来实现这一目标。\n但是,我无法理解什么下面的代码是错误的。

\n
    \n
  • [python - 使用 scrapy 解释回调和 cb_kwargs - 堆栈\n溢出]\n(使用 scrapy 解释回调和 cb_kwargs

    \n
  • \n
  • [python - Scrapy 中每个项目有多个页面。\n(/sf/ask/1554131351/?noredirect=1&lq=1)

    \n
  • \n
\n

程序运行,但 csv 不输出任何内容,如下图所示。\n在此处输入图像描述

\n

订单结果页面每页包含 30 个商品的信息。\n我想首先检索每个商品的所有注册日期(仅在第一页上列出),然后从那里移至每个产品页面以检索详细信息,然后然后一次将这些信息存储一项。

\n

我是一个初学者,3个月前开始用Python编写代码。\n所以我可能对类等的基本理解存在一些问题。\n如果你能在我们讨论时向我指出这一点,我将不胜感激。\n官方文档scrapy对初学者来说太不友好了,我很难用它。

\n
 def parse_firstpage_item(self, response): \n            request = scrapy.Request(\n                url=response.url,\n                callback=self.parse_productpage_item,\n                cb_kwargs=dict(product_URL=\'//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a\'))\n    \n            loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)\n    \n            loader.add_xpath("Conversion_date", \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()\')\n    \n            yield loader.load_item()\n    \n    \n        def parse_productpage_item(self, response, product_URL): \n    \n            loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)\n    \n            loader.add_xpath("brand_name", \'normalize-space(//*[@id="s_brand"]/dd/a/text())\')\n           \n            loader.add_value("page_URL" , response.url) \n            loader.add_xpath("inquire" , \'//*[@id="tabmenu_inqcnt"]/text()\')\n    \n            yield loader.load_item()\n
Run Code Online (Sandbox Code Playgroud)\n
class MyLinkExtractor(LinkExtractor):\n    def extract_links(self, response):\n        base_url = get_base_url(response)\n        if self.restrict_xpaths:\n\n            docs = [\n                subdoc\n                for x in self.restrict_xpaths\n                for subdoc in response.xpath(x)\n            ]\n        else:\n            docs = [response.selector]\n        all_links = []\n        for doc in docs:\n            links = self._extract_links(doc, response.url, response.encoding, base_url)\n            all_links.extend(self._process_links(links))\n        logging.info(\'=\'*100)\n        logging.info(all_links)\n        logging.info(f\'total liks len: {len(all_links)}\')\n        logging.info(\'=\'*100)\n        return all_links\n\n\nclass AllSaledataSpider(CrawlSpider):\n    name = \'all_salesdata\'\n    allowed_domains = [\'www.buyma.com\']\n    # start_urls = [\'https://www.buyma.com/buyer/9887867/sales_1.html\']\n   \n    rules = (\n        Rule(MyLinkExtractor(\n            restrict_xpaths=\'//*[@class="buyeritem_name"]/a\'), callback=\'parse_firstpage_item\', follow=False),\n           \n        Rule(LinkExtractor(restrict_xpaths=\'//DIV[@class="pager"]/DIV/A[contains(text(),"\xe6\xac\xa1")]\'),follow=False)\n            )\n    def _requests_to_follow(self, response):\n        if not isinstance(response, HtmlResponse):\n            return\n        seen = set()\n        for rule_index, rule in enumerate(self._rules):\n            links = [lnk for lnk in rule.link_extractor.extract_links(response)]\n                    #if lnk not in seen]\n            for link in rule.process_links(links):\n                seen.add(link)\n                request = self._build_request(rule_index, link)\n                yield rule.process_request(request, response)\n\n    def start_requests(self):\n     with open(\'/Users/morni/buyma_researchtool/buyma_researchtool/AllshoppersURL.csv\', \'r\', encoding=\'utf-8\') as f:\n        reader = csv.reader(f)\n        header = next(reader)\n        for row in reader:\n            yield scrapy.Request(url = str(row[2])[:-5]+\'/sales_1.html\')\n        for row in self.reader:\n            for n in range(1, 300):\n                url = f\'{self.base_page}{row}/sales_{n}.html\'\n                yield scrapy.Request(\n                    url=url,\n                    callback=self.parse_firstpage_item,\n                    errback=self.errback_httpbin,\n                    dont_filter=True\n                    )        \n\n    def parse_firstpage_item(self, response): \n\n        loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)\n\n        loader.add_xpath("Conversion_date", \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()\')\n        loader.add_xpath("product_name" , \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()\')\n        loader.add_value("product_URL" , \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href\')\n        item = loader.load_item()\n\n        yield scrapy.Request(\n        url=response.urljoin(item[\'product_URL\']),\n        callback=self.parse_productpage_item,\n        cb_kwargs={\'item\': item},\n        )\n\n    def parse_productpage_item(self, response, item):\n\n        loader = ItemLoader(item=item, response = response)\n\n        loader.add_xpath("brand_name", \'normalize-space(//*[@id="s_brand"]/dd/a/text())\')\n        \xe3\x80\x9c\n\n        yield loader.load_item()\n
Run Code Online (Sandbox Code Playgroud)\n

gan*_*ass 5

您需要调用每个页面并将当前项目传递给回调:

def parse_first_page(self, response): 
    loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
    loader.add_xpath("brand_name", 'normalize-space(//*[@id="s_brand"]/dd/a/text())')
    loader.add_value("page_URL" , response.url) 
    loader.add_xpath("inquire" , '//*[@id="tabmenu_inqcnt"]/text()')
    item = loader.load_item()

    yield scrapy. Request(
        url=second_page_url,
        callback=self.parse_second_page,
        cb_kwargs={'item': item},
    )

def parse_second_page(self, response, item): 
    loader = ItemLoader(item=item, response=response)
    loader.add_xpath("Conversion_date", '//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()')
    item = loader.load_item()

    yield scrapy. Request(
        url=third_page_url,
        callback=self.parse_third_page,
        cb_kwargs={'item': item},
    )

def parse_third_page(self, response, item): 
    loader = ItemLoader(item=item, response=response)
    loader.add_value('ThirdUrl', response.url)
    yield loader.load_item()
Run Code Online (Sandbox Code Playgroud)