Scrapy、privoxy 和 Tor:SocketError:[Errno 61] 连接被拒绝

5 python tor scrapy web-scraping

我将 Scrapy 与 Privoxy 和 Tor 一起使用。这是我之前的问题Scrapy with Privoxy and Tor: how to renew IP,这是蜘蛛:

\n\n
from scrapy.contrib.spiders import CrawlSpider\nfrom scrapy.selector import Selector\nfrom scrapy.http import Request\n\nclass YourCrawler(CrawlSpider):\n    name = "****"\n    start_urls = [\n    \'https://****.com/listviews/titles.php\',\n    ]\n    allowed_domains = ["****.com"]\n\n    def parse(self, response):\n        # go to the urls in the list\n        s = Selector(response)\n        page_list_urls = s.xpath(\'///*[@id="tab7"]/article/header/h2/a/@href\').extract()\n        for url in page_list_urls:\n            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)\n\n        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again\n        next_page = response.css(\'ul.pagin li.presente ~ li a::attr(href)\').extract_first()\n        if next_page is not None:\n            next_page = response.urljoin(next_page)\n            yield Request(next_page, callback=self.parse)\n\n    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li\n    def parse_following_urls(self, response):\n        #Parsing rules go here\n        for each_book in response.css(\'main#main\'):\n            yield {\n                \'editor\': each_book.css(\'header.datos1 > ul > li > h5 > a::text\').extract(),\n            }\n
Run Code Online (Sandbox Code Playgroud)\n\n

在 settings.py 中,我有一个用户代理轮换和 privoxy:

\n\n
DOWNLOADER_MIDDLEWARES = {\n        #user agent\n        \'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware\' : None,\n        \'****.comm.rotate_useragent.RotateUserAgentMiddleware\' :400,\n        #privoxy\n        \'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware\': 110,\n        \'****.middlewares.ProxyMiddleware\': 100\n    }\n
Run Code Online (Sandbox Code Playgroud)\n\n

在 middlewares.py 中我添加了:

\n\n
from stem import Signal\nfrom stem.control import Controller\n\ndef _set_new_ip():\n    with Controller.from_port(port=9051) as controller:\n        controller.authenticate(password=\'tor_password\')\n        controller.signal(Signal.NEWNYM)\n\nclass ProxyMiddleware(object):\n    def process_request(self, request, spider):\n        _set_new_ip()\n        request.meta[\'proxy\'] = \'http://127.0.0.1:8118\'\n        spider.log(\'Proxy : %s\' % request.meta[\'proxy\'])\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果我取出def _set_new_ip():middlewares.py 中的类的方法(并且在蜘蛛中调用它是class ProxyMiddleware(object):有效的。但是我希望蜘蛛每次都调用一个新的IP,这就是我添加它的原因。问题是每次我尝试运行蜘蛛时它都会返回一个错误SocketError: [Errno 61] Connection refused,并带有以下回溯:

\n\n
Traceback (most recent call last):\n  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks\n    result = g.send(result)\n  File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request\n    response = yield method(request=request, spider=spider)\n  File "/Users/nikita/scrapy/***/***/middlewares.py", line 71, in process_request\n    _set_new_ip()\n  File "/Users/nikita/scrapy/***/***/middlewares.py", line 65, in _set_new_ip\n    with Controller.from_port(port=9051) as controller:\n  File "/usr/local/lib/python2.7/site-packages/stem/control.py", line 998, in from_port\n    control_port = stem.socket.ControlPort(address, port)\n  File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 372, in __init__\n    self.connect()\n  File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 243, in connect\n    self._socket = self._make_socket()\n  File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 401, in _make_socket\n    raise stem.SocketError(exc)\nSocketError: [Errno 61] Connection refused\n2017-07-11 15:50:28 [scrapy.core.engine] INFO: Closing spider (finished)\n
Run Code Online (Sandbox Code Playgroud)\n\n

也许问题出在使用的端口上with Controller.from_port(port=9051) as controller:,但我不确定。如果有人有一个想法那就太棒了\xe2\x80\xa6

\n\n

编辑 - -

\n\n

好的,如果我打开浏览器并访问http://127.0.0.1:8118/,它会显示:

\n\n
503 \nThis is Privoxy 3.0.26 on localhost (127.0.0.1), port 8118, enabled\nForwarding failure\nPrivoxy was unable to socks5-forward your request http://127.0.0.1:8118/ through localhost: SOCKS5 request failed\n\nJust try again to see if this is a temporary problem, or check your forwarding settings and make sure that all forwarding servers are working correctly and listening where they are supposed to be listening.\n
Run Code Online (Sandbox Code Playgroud)\n\n

所以可能与 SOCKS5\xe2\x80\xa6 的配置有关有人知道吗?

\n

bta*_*aek 2

我的猜测是:

  1. Tor 没有运行。要确定 Tor 是否正在运行,请在终端上运行ps(例如,ps -ax | grep tor)和netstat(例如,对于 mac:。netstat -an | grep 'your tor portnumber'对于 linux,替换-an-tulnp)以查看 Tor 是否真正在运行。
  2. 您没有正确设置转发设置。根据 503 错误消息,您似乎没有正确设置转发规则(如果 Tor 正在运行)。在 Privoxy 的配置文件中,确保forward-socks5t / 127.0.0.1:9050 .取消注释。