K_M*_*_MM 0 python web-crawler scrapy web-scraping
目标:我想检索特定电子商务网站上发布的订单绩效数据。由于每个订单绩效的这些数据分布在多个页面上,因此我们希望提取每个页面的信息,并最终将它们汇总为单个项目或记录。
\n我浏览了官方文档和其他类似的质量检查并找到了一些。\n从这些信息中,我了解到可以通过使用 cb_kwargs 来实现这一目标。\n但是,我无法理解什么下面的代码是错误的。
\n[python - 使用 scrapy 解释回调和 cb_kwargs - 堆栈\n溢出]\n(使用 scrapy 解释回调和 cb_kwargs)
\n[python - Scrapy 中每个项目有多个页面。\n(/sf/ask/1554131351/?noredirect=1&lq=1)
\n程序运行,但 csv 不输出任何内容,如下图所示。\n在此处输入图像描述
\n订单结果页面每页包含 30 个商品的信息。\n我想首先检索每个商品的所有注册日期(仅在第一页上列出),然后从那里移至每个产品页面以检索详细信息,然后然后一次将这些信息存储一项。
\n我是一个初学者,3个月前开始用Python编写代码。\n所以我可能对类等的基本理解存在一些问题。\n如果你能在我们讨论时向我指出这一点,我将不胜感激。\n官方文档scrapy对初学者来说太不友好了,我很难用它。
\n def parse_firstpage_item(self, response): \n request = scrapy.Request(\n url=response.url,\n callback=self.parse_productpage_item,\n cb_kwargs=dict(product_URL=\'//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a\'))\n \n loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)\n \n loader.add_xpath("Conversion_date", \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()\')\n \n yield loader.load_item()\n \n \n def parse_productpage_item(self, response, product_URL): \n \n loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)\n \n loader.add_xpath("brand_name", \'normalize-space(//*[@id="s_brand"]/dd/a/text())\')\n \n loader.add_value("page_URL" , response.url) \n loader.add_xpath("inquire" , \'//*[@id="tabmenu_inqcnt"]/text()\')\n \n yield loader.load_item()\nRun Code Online (Sandbox Code Playgroud)\nclass MyLinkExtractor(LinkExtractor):\n def extract_links(self, response):\n base_url = get_base_url(response)\n if self.restrict_xpaths:\n\n docs = [\n subdoc\n for x in self.restrict_xpaths\n for subdoc in response.xpath(x)\n ]\n else:\n docs = [response.selector]\n all_links = []\n for doc in docs:\n links = self._extract_links(doc, response.url, response.encoding, base_url)\n all_links.extend(self._process_links(links))\n logging.info(\'=\'*100)\n logging.info(all_links)\n logging.info(f\'total liks len: {len(all_links)}\')\n logging.info(\'=\'*100)\n return all_links\n\n\nclass AllSaledataSpider(CrawlSpider):\n name = \'all_salesdata\'\n allowed_domains = [\'www.buyma.com\']\n # start_urls = [\'https://www.buyma.com/buyer/9887867/sales_1.html\']\n \n rules = (\n Rule(MyLinkExtractor(\n restrict_xpaths=\'//*[@class="buyeritem_name"]/a\'), callback=\'parse_firstpage_item\', follow=False),\n \n Rule(LinkExtractor(restrict_xpaths=\'//DIV[@class="pager"]/DIV/A[contains(text(),"\xe6\xac\xa1")]\'),follow=False)\n )\n def _requests_to_follow(self, response):\n if not isinstance(response, HtmlResponse):\n return\n seen = set()\n for rule_index, rule in enumerate(self._rules):\n links = [lnk for lnk in rule.link_extractor.extract_links(response)]\n #if lnk not in seen]\n for link in rule.process_links(links):\n seen.add(link)\n request = self._build_request(rule_index, link)\n yield rule.process_request(request, response)\n\n def start_requests(self):\n with open(\'/Users/morni/buyma_researchtool/buyma_researchtool/AllshoppersURL.csv\', \'r\', encoding=\'utf-8\') as f:\n reader = csv.reader(f)\n header = next(reader)\n for row in reader:\n yield scrapy.Request(url = str(row[2])[:-5]+\'/sales_1.html\')\n for row in self.reader:\n for n in range(1, 300):\n url = f\'{self.base_page}{row}/sales_{n}.html\'\n yield scrapy.Request(\n url=url,\n callback=self.parse_firstpage_item,\n errback=self.errback_httpbin,\n dont_filter=True\n ) \n\n def parse_firstpage_item(self, response): \n\n loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)\n\n loader.add_xpath("Conversion_date", \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()\')\n loader.add_xpath("product_name" , \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()\')\n loader.add_value("product_URL" , \'//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href\')\n item = loader.load_item()\n\n yield scrapy.Request(\n url=response.urljoin(item[\'product_URL\']),\n callback=self.parse_productpage_item,\n cb_kwargs={\'item\': item},\n )\n\n def parse_productpage_item(self, response, item):\n\n loader = ItemLoader(item=item, response = response)\n\n loader.add_xpath("brand_name", \'normalize-space(//*[@id="s_brand"]/dd/a/text())\')\n \xe3\x80\x9c\n\n yield loader.load_item()\nRun Code Online (Sandbox Code Playgroud)\n
您需要调用每个页面并将当前项目传递给回调:
def parse_first_page(self, response):
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_xpath("brand_name", 'normalize-space(//*[@id="s_brand"]/dd/a/text())')
loader.add_value("page_URL" , response.url)
loader.add_xpath("inquire" , '//*[@id="tabmenu_inqcnt"]/text()')
item = loader.load_item()
yield scrapy. Request(
url=second_page_url,
callback=self.parse_second_page,
cb_kwargs={'item': item},
)
def parse_second_page(self, response, item):
loader = ItemLoader(item=item, response=response)
loader.add_xpath("Conversion_date", '//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()')
item = loader.load_item()
yield scrapy. Request(
url=third_page_url,
callback=self.parse_third_page,
cb_kwargs={'item': item},
)
def parse_third_page(self, response, item):
loader = ItemLoader(item=item, response=response)
loader.add_value('ThirdUrl', response.url)
yield loader.load_item()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
178 次 |
| 最近记录: |