如何用Scrapy递归爬取子页面

Question

如何用Scrapy递归爬取子页面

jet*_*131 3 python beautifulsoup web-crawler scrapy

所以基本上我试图抓取一个包含一组类别的页面，抓取每个类别的名称，按照与每个类别关联的子链接到包含一组子类别的页面，抓取它们的名称，然后按照每个子类别找到其关联的页面并检索文本数据。最后我想输出一个格式类似于以下的 json 文件：

类别 1 名称
- 子类别1名称
  - 该子类别页面的数据
- 子类别名称
  - 此页面的数据
类别 n 名称
- 子类别1名称
  - 子类别 n 页面的数据

ETC。

最终我希望能够将这些数据与 ElasticSearch 一起使用

我几乎没有任何使用 Scrapy 的经验，这就是我到目前为止所拥有的（只是从第一页上刮掉类别名称，我不知道从这里开始做什么）...根据我的研究，我相信我需要使用 CrawlSpider但我不确定这意味着什么。我也被建议使用 BeautifulSoup。任何帮助将不胜感激。

class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            yield {
                'categories': i.css('a::text').extract_first()
            }

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cas*_*per 6

不熟悉 ElasticSearch，但我会构建一个像这样的爬虫：

class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            subcategory = i.css('Put your selector here') # This is where you select the subcategory url
            req = scrapy.Request(subcategory, callback=self.parse_subcategory)
            req.meta['category'] = i.css('a::text').extract_first()
            yield req

    def parse_subcategory(self, response):
        yield {
            'category' : response.meta.get('category')
            'subcategory' : response.css('Put your selector here') # Select the name of the subcategory
            'subcategorydata' : response.css('Put your selector here') # Select the data of the subcategory
        }

Run Code Online (Sandbox Code Playgroud)

您收集子类别 URL 并发送请求。此请求的响应将在中打开parse_subcategory。发送此请求时，我们在元数据中添加类别名称。

在该parse_subcategory函数中，您从元数据中获取类别名称，并从网页中收集子类别数据。

归档时间：	8 年，9 月前
查看次数：	2698 次
最近记录：	8 年，9 月前