如何通过Polipo over TOR通过Scrapy连接到https站点?

Cra*_*ton 8 python tor scrapy

不完全确定这里的问题是什么.

运行Python 2.7.3和Scrapy 0.16.5

我已经创建了一个非常简单的Scrapy蜘蛛来测试连接到我的本地Polipo代理,因此我可以通过TOR发送请求.我蜘蛛的基本代码如下:

from scrapy.spider import BaseSpider

class TorSpider(BaseSpider):
    name = "tor"
    allowed_domains = ["check.torproject.org"]
    start_urls = [
        "https://check.torproject.org"
    ]

    def parse(self, response):
        print response.body
Run Code Online (Sandbox Code Playgroud)

对于我的代理中间件,我已经定义:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')
Run Code Online (Sandbox Code Playgroud)

我的设置文件中的HTTP_PROXY定义为HTTP_PROXY = 'http://localhost:8123'.

现在,如果我将我的起始URL更改为http://check.torproject.org,一切正常,没有问题.

如果我尝试针对https://check.torproject.org运行,每次都会收到400 Bad Request错误(我也尝试过不同的https://站点,并且所有这些站点都有相同的问题):

2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)
Run Code Online (Sandbox Code Playgroud)

只是为了仔细检查我的TOR/Polipo设置是不是有问题,我可以在终端中运行以下curl命令,并连接正常: curl --proxy localhost:8123 https://check.torproject.org/

关于这里有什么问题的任何建议?

小智 0

不确定这些是否可以帮助您:

  • 尽管您的链接可能包含答案,但 StackOverflow 的目标之一是编目和组织问题的实际解决方案,而不仅仅是可能损坏或需要额外解析的链接。如果您可以总结答案中的相关部分并使用链接作为参考,这将使您的答案更容易被接受。请参阅[此页](http://stackoverflow.com/questions/how-to-answer) 了解更多指南。 (4认同)