Scrapy CrawlSpider规则有多个回调

Question

Scrapy CrawlSpider规则有多个回调

我想创建一个实现scrapy CrawlSpider的ExampleSpider.我的ExampleSpider应该能够处理仅包含艺术家信息的页面,仅包含专辑信息的页面,以及包含专辑和艺术家信息的一些其他页面.

我能够处理前两个场景.但问题发生在第三种情况.我正在使用parse_artist(response)方法来处理艺术家数据,parse_album(response)处理相册数据的方法.我的问题是,如果一个页面同时包含艺术家和专辑数据,我应该如何定义我的规则？

我想在下面好吗？(相同网址格式的两条规则)
我应该多次回调吗？(scrapy是否支持多个回调？)

还有其他办法吗？(一种正确的方式)

class ExampleSpider(CrawlSpider):
    name = 'example'

    start_urls = ['http://www.example.com']

    rules = [
        Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
        Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
        # more rules .....
    ]

    def parse_artist(self, response):
        artist_item = ArtistItem()
        try:
            # do the scrape and assign to ArtistItem
        except Exception:
            # ignore for now
            pass
        return artist_item
        pass

    def parse_album(self, response):
        album_item = AlbumItem()
        try:
            # do the scrape and assign to AlbumItem
        except Exception:
            # ignore for now
            pass
        return album_item
        pass
    pass

Run Code Online (Sandbox Code Playgroud)

Answer 1

kev*_*kev 8

该CrawlSpider调用_requests_to_follow()方法来提取URL和生成请求如下:

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        seen = seen.union(links)
        for link in links:
            r = Request(url=link.url, callback=self._response_downloaded)
            r.meta.update(rule=n, link_text=link.text)
            yield rule.process_request(r)

Run Code Online (Sandbox Code Playgroud)

如你看到的:

已处理变量seen记忆urls.
每个人url最多只能解析一个callback.

您可以定义一个parse_item()来调用parse_artist()和parse_album():

rules = [
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
    # more rules .....
]

def parse_item(self, response):

    yield self.parse_artist(response)
    yield self.parse_album(response)

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，6 月前
查看次数：	3005 次
最近记录：	11 年，6 月前