Pro*_*Joe 3 python web-crawler scrapy web-scraping
我已经设置了“规则”以从start_url获取下一页,但是它不起作用,它仅对start_urls页面以及该页面中的链接(带有parseLinks)进行爬网。它不会转到“规则”中设置的下一页。
有什么帮助吗?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy import log
from urlparse import urlparse
from urlparse import urljoin
from scrapy.http import Request
class MySpider(CrawlSpider):
name = 'testes2'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)
def parse(self, response):
sel = Selector(response)
urls = sel.xpath('//div[@id="btReserve"]/../@href').extract()
for url in urls:
url = urljoin(response.url, url)
self.log('URLS: %s' % url)
yield Request(url, callback = self.parseLinks)
def parseLinks(self, response):
sel = Selector(response)
titulo = sel.xpath('h1/text()').extract()
morada = sel.xpath('//div[@class="MORADA"]/text()').extract()
email = sel.xpath('//a[@class="sendMail"][1]/text()')[0].extract()
url = sel.xpath('//div[@class="contentContacto sendUrl"]/a/text()').extract()
telefone = sel.xpath('//div[@class="telefone"]/div[@class="contentContacto"]/text()').extract()
fax = sel.xpath('//div[@class="fax"]/div[@class="contentContacto"]/text()').extract()
descricao = sel.xpath('//div[@id="tbDescricao"]/p/text()').extract()
gps = sel.xpath('//td[@class="sendGps"]/@style').extract()
print titulo, email, morada
Run Code Online (Sandbox Code Playgroud)
您不应覆盖parse
from中的方法CrawlSpider
,否则Rule
将不遵循。
请参阅以下警告,网址为http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。