Oma*_*awi 5 python twisted scrapy mime-types
我有一个正在运行的scrapy项目,但它是带宽密集型的,因为它试图下载大量的二进制文件(zip,tar,mp3,.. etc).
我认为最好的解决方案是根据mimetype(Content-Type :) HTTP标头过滤请求.我查看了scrapy代码并找到了这个设置:
DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
Run Code Online (Sandbox Code Playgroud)
我将其更改为:DOWNLOADER_HTTPCLIENTFACTORY ='myproject.webclients.ScrapyHTTPClientFactory'
并播放了一点ScrapyHTTPPageGetter,这里是编辑突出显示:
class ScrapyHTTPPageGetter(HTTPClient):
# this is my edit
def handleEndHeaders(self):
if 'Content-Type' in self.headers.keys():
mimetype = str(self.headers['Content-Type'])
# Actually I need only the html, but just in
# case I've preserved all the text
if mimetype.find('text/') > -1:
# Good, this page is needed
self.factory.gotHeaders(self.headers)
else:
self.factory.noPage(Exception('Incorrect Content-Type'))
Run Code Online (Sandbox Code Playgroud)
我觉得这是错的,在确定它是不需要的mimetype后,我需要更多scrapy友好的方式来取消/删除请求.而不是等待整个数据下载.
编辑:
我要求具体说明这部分self.factory.noPage(Exception('Incorrect Content-Type'))是取消请求的正确方法.
更新1:
我当前的设置已经破坏了Scrapy服务器,因此请不要尝试使用上面相同的代码来解决问题.
更新2:
我已经使用以下结构设置了一个基于Apache的网站进行测试:
/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip
Run Code Online (Sandbox Code Playgroud)
我注意到Scrapy丢弃带有.zip扩展名的那些,但是在没有.zip的情况下丢弃它,即使它只是它的符号链接.
我构建了这个中间件,以排除任何不在正则表达式白名单中的响应类型:
from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re
class FilterResponses(object):
"""Limit the HTTP response types that Scrapy dowloads."""
@staticmethod
def is_valid_response(type_whitelist, content_type_header):
for type_regex in type_whitelist:
if re.search(type_regex, content_type_header):
return True
return False
def process_response(self, request, response, spider):
"""
Only allow HTTP response types that that match the given list of
filtering regexs
"""
# each spider must define the variable response_type_whitelist as an
# iterable of regular expressions. ex. (r'text', )
type_whitelist = getattr(spider, "response_type_whitelist", None)
content_type_header = response.headers.get('content-type', None)
if not type_whitelist:
return response
elif not content_type_header:
log.msg("no content type header: {}".format(response.url), level=log.DEBUG, spider=spider)
raise IgnoreRequest()
elif self.is_valid_response(type_whitelist, content_type_header):
log.msg("valid response {}".format(response.url), level=log.DEBUG, spider=spider)
return response
else:
msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
log.msg(msg, level=log.DEBUG, spider=spider)
raise IgnoreRequest()
Run Code Online (Sandbox Code Playgroud)
要使用它,请将其添加到settings.py:
DOWNLOADER_MIDDLEWARES = {
'[project_name].middlewares.FilterResponses': 999,
}
Run Code Online (Sandbox Code Playgroud)
Oma*_*awi -1
解决方案是设置一个Node.js代理并配置Scrapy通过环境变量使用它http_proxy。
代理应该做的是:
403 Forbidden向 Scrapy 发送错误并立即关闭请求/响应。这有助于节省时间、流量,并且 Scrapy 不会崩溃。这确实有效!
http.createServer(function(clientReq, clientRes) {
var options = {
host: clientReq.headers['host'],
port: 80,
path: clientReq.url,
method: clientReq.method,
headers: clientReq.headers
};
var fullUrl = clientReq.headers['host'] + clientReq.url;
var proxyReq = http.request(options, function(proxyRes) {
var contentType = proxyRes.headers['content-type'] || '';
if (!contentType.startsWith('text/')) {
proxyRes.destroy();
var httpForbidden = 403;
clientRes.writeHead(httpForbidden);
clientRes.write('Binary download is disabled.');
clientRes.end();
}
clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
proxyRes.pipe(clientRes);
});
proxyReq.on('error', function(e) {
console.log('problem with clientReq: ' + e.message);
});
proxyReq.end();
}).listen(8080);
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1853 次 |
| 最近记录: |