wget - 如何递归下载并且仅下载特定的 MIME 类型/扩展（即仅文本）

解决方案是设置一个Node.js代理并配置Scrapy通过环境变量使用它http_proxy。

代理应该做的是：

从 Scrapy 获取 HTTP 请求并将其发送到正在爬取的服务器。然后它返回 Scrapy 的响应，即拦截所有 HTTP 流量。

对于二进制文件（基于您实施的启发式），它会403 Forbidden向 Scrapy 发送错误并立即关闭请求/响应。这有助于节省时间、流量，并且 Scrapy 不会崩溃。

实际有效的示例代理代码！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年前
查看次数：	32747 次
最近记录：	6 年，11 月前