我如何跳到Scrapy Rules中的下一页

Pro*_*Joe 3 python web-crawler scrapy web-scraping

我已经设置了“规则”以从start_url获取下一页,但是它不起作用,它仅对start_urls页面以及该页面中的链接(带有parseLinks)进行爬网。它不会转到“规则”中设置的下一页。

有什么帮助吗?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy import log
from urlparse import urlparse
from urlparse import urljoin
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = 'testes2'
    allowed_domains = ['example.com']
    start_urls = [
    'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]

rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)

def parse(self, response):
     sel = Selector(response)
     urls = sel.xpath('//div[@id="btReserve"]/../@href').extract()
     for url in urls:
        url = urljoin(response.url, url)
        self.log('URLS: %s' % url)
        yield Request(url, callback = self.parseLinks)

def parseLinks(self, response):
    sel = Selector(response)
    titulo = sel.xpath('h1/text()').extract()
    morada = sel.xpath('//div[@class="MORADA"]/text()').extract()
    email = sel.xpath('//a[@class="sendMail"][1]/text()')[0].extract()
    url = sel.xpath('//div[@class="contentContacto sendUrl"]/a/text()').extract()
    telefone = sel.xpath('//div[@class="telefone"]/div[@class="contentContacto"]/text()').extract()
    fax = sel.xpath('//div[@class="fax"]/div[@class="contentContacto"]/text()').extract()
    descricao = sel.xpath('//div[@id="tbDescricao"]/p/text()').extract()
    gps = sel.xpath('//td[@class="sendGps"]/@style').extract()

    print titulo, email, morada
Run Code Online (Sandbox Code Playgroud)

pau*_*rth 5

您不应覆盖parsefrom中的方法CrawlSpider,否则Rule将不遵循。

请参阅以下警告,网址http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules

编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。