我正在尝试编写一个非常简单的网站爬虫来列出 URL 以及 200、301、302 和 404 http 状态代码的引用和状态代码。
事实证明,Scrapy 工作得很好,我的脚本正确使用它来抓取网站,并且可以毫无问题地列出带有 200 和 404 状态代码的 url。
问题是:我找不到如何让scrapy跟随重定向并解析/输出它们。我可以让一个工作,但不能两个都工作。
到目前为止我尝试过的:
设置meta={'dont_redirect':True}和设置REDIRECTS_ENABLED = False
将 301、302 添加到 handle_httpstatus_list
更改重定向中间件文档中指定的设置
阅读重定向中间件代码以获得洞察力
以上所有的各种组合
其他随机的东西
如果你想看一下代码,这里是公共仓库。
如果您想解析 301 和 302 响应,并同时关注它们,请要求您的回调处理 301 和 302 并模仿 RedirectMiddleware 的行为。
让我们先用一个简单的蜘蛛来说明(还没有按照你的意图工作):
import scrapy
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
def parse(self, response):
self.logger.info("got response for %r" % response.url)
Run Code Online (Sandbox Code Playgroud)
现在,蜘蛛要求 2 页,第二页应该重定向到http://www.example.com
$ scrapy runspider test.py
2016-09-30 11:28:17 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:28:18 [scrapy] DEBUG: Redirecting (302) to <GET http://example.com/> from <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>
2016-09-30 11:28:18 [handle] INFO: got response for 'https://httpbin.org/get'
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-09-30 11:28:18 [handle] INFO: got response for 'http://example.com/'
2016-09-30 11:28:18 [scrapy] INFO: Spider closed (finished)
Run Code Online (Sandbox Code Playgroud)
302 由RedirectMiddleware自动处理,它不会传递给您的回调。
让我们配置蜘蛛以处理回调中的 301 和 302,使用handle_httpstatus_list:
import scrapy
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
Run Code Online (Sandbox Code Playgroud)
让我们运行它:
$ scrapy runspider test.py
2016-09-30 11:33:32 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:33:33 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:33:33 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:33:33 [scrapy] INFO: Spider closed (finished)
Run Code Online (Sandbox Code Playgroud)
在这里,我们缺少重定向。
执行与 RedirectMiddleware 相同的操作,但在蜘蛛回调中:
from six.moves.urllib.parse import urljoin
import scrapy
from scrapy.utils.python import to_native_str
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
# do something with the response here...
# handle redirection
# this is copied/adapted from RedirectMiddleware
if response.status >= 300 and response.status < 400:
# HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
location = to_native_str(response.headers['location'].decode('latin1'))
# get the original request
request = response.request
# and the URL we got redirected to
redirected_url = urljoin(request.url, location)
if response.status in (301, 307) or request.method == 'HEAD':
redirected = request.replace(url=redirected_url)
yield redirected
else:
redirected = request.replace(url=redirected_url, method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
yield redirected
Run Code Online (Sandbox Code Playgroud)
然后再次运行蜘蛛:
$ scrapy runspider test.py
2016-09-30 11:45:20 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:45:21 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'http://example.com/'
2016-09-30 11:45:21 [scrapy] INFO: Spider closed (finished)
Run Code Online (Sandbox Code Playgroud)
我们被重定向到http://www.example.com,我们也通过回调得到了响应。