这是我编写的用于抓取“blablacar”网站的代码。
# -*- coding: utf-8 -*-
import scrapy
class BlablaSpider(scrapy.Spider):
name = 'blabla'
allowed_domains = ['blablacar.in']
start_urls = ['http://www.blablacar.in/ride-sharing/new-delhi/chandigarh']
def parse(self, response):
print(response.text)
Run Code Online (Sandbox Code Playgroud)
运行上述程序时,我收到错误消息
2018-06-11 00:07:05 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 127.0.0.1:6023 2018-06-11 00:07:06 [scrapy.core.engine] 调试:爬行 (403) ) http://www.blablacar.in/robots.txt> (referer: None) 2018-06-11 00:07:06 [scrapy.core.engine] DEBUG: Crawled (403) http://www.blablacar .in/ride-sharing/new-delhi/chandigarh> (referer: None) 2018-06-11 00:07:06 [scrapy.spidermiddlewares.httperror] INFO:忽略响应 <403 http://www.blablacar.in /ride-sharing/new-delhi/chandigarh >: HTTP 状态代码未处理或不允许 2018-06-11 00:07:06 [scrapy.core.engine] INFO: Closing spider (finished)