5 python tor scrapy web-scraping
我将 Scrapy 与 Privoxy 和 Tor 一起使用。这是我之前的问题Scrapy with Privoxy and Tor: how to renew IP,这是蜘蛛:
\n\nfrom scrapy.contrib.spiders import CrawlSpider\nfrom scrapy.selector import Selector\nfrom scrapy.http import Request\n\nclass YourCrawler(CrawlSpider):\n name = "****"\n start_urls = [\n \'https://****.com/listviews/titles.php\',\n ]\n allowed_domains = ["****.com"]\n\n def parse(self, response):\n # go to the urls in the list\n s = Selector(response)\n page_list_urls = s.xpath(\'///*[@id="tab7"]/article/header/h2/a/@href\').extract()\n for url in page_list_urls:\n yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)\n\n # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again\n next_page = response.css(\'ul.pagin li.presente ~ li a::attr(href)\').extract_first()\n if next_page is not None:\n next_page = response.urljoin(next_page)\n yield Request(next_page, callback=self.parse)\n\n # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li\n def parse_following_urls(self, response):\n #Parsing rules go here\n for each_book in response.css(\'main#main\'):\n yield {\n \'editor\': each_book.css(\'header.datos1 > ul > li > h5 > a::text\').extract(),\n }\nRun Code Online (Sandbox Code Playgroud)\n\n在 settings.py 中,我有一个用户代理轮换和 privoxy:
\n\nDOWNLOADER_MIDDLEWARES = {\n #user agent\n \'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware\' : None,\n \'****.comm.rotate_useragent.RotateUserAgentMiddleware\' :400,\n #privoxy\n \'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware\': 110,\n \'****.middlewares.ProxyMiddleware\': 100\n }\nRun Code Online (Sandbox Code Playgroud)\n\n在 middlewares.py 中我添加了:
\n\nfrom stem import Signal\nfrom stem.control import Controller\n\ndef _set_new_ip():\n with Controller.from_port(port=9051) as controller:\n controller.authenticate(password=\'tor_password\')\n controller.signal(Signal.NEWNYM)\n\nclass ProxyMiddleware(object):\n def process_request(self, request, spider):\n _set_new_ip()\n request.meta[\'proxy\'] = \'http://127.0.0.1:8118\'\n spider.log(\'Proxy : %s\' % request.meta[\'proxy\'])\nRun Code Online (Sandbox Code Playgroud)\n\n如果我取出def _set_new_ip():middlewares.py 中的类的方法(并且在蜘蛛中调用它是class ProxyMiddleware(object):有效的。但是我希望蜘蛛每次都调用一个新的IP,这就是我添加它的原因。问题是每次我尝试运行蜘蛛时它都会返回一个错误SocketError: [Errno 61] Connection refused,并带有以下回溯:
Traceback (most recent call last):\n File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks\n result = g.send(result)\n File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request\n response = yield method(request=request, spider=spider)\n File "/Users/nikita/scrapy/***/***/middlewares.py", line 71, in process_request\n _set_new_ip()\n File "/Users/nikita/scrapy/***/***/middlewares.py", line 65, in _set_new_ip\n with Controller.from_port(port=9051) as controller:\n File "/usr/local/lib/python2.7/site-packages/stem/control.py", line 998, in from_port\n control_port = stem.socket.ControlPort(address, port)\n File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 372, in __init__\n self.connect()\n File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 243, in connect\n self._socket = self._make_socket()\n File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 401, in _make_socket\n raise stem.SocketError(exc)\nSocketError: [Errno 61] Connection refused\n2017-07-11 15:50:28 [scrapy.core.engine] INFO: Closing spider (finished)\nRun Code Online (Sandbox Code Playgroud)\n\n也许问题出在使用的端口上with Controller.from_port(port=9051) as controller:,但我不确定。如果有人有一个想法那就太棒了\xe2\x80\xa6
编辑 - -
\n\n好的,如果我打开浏览器并访问http://127.0.0.1:8118/,它会显示:
\n\n503 \nThis is Privoxy 3.0.26 on localhost (127.0.0.1), port 8118, enabled\nForwarding failure\nPrivoxy was unable to socks5-forward your request http://127.0.0.1:8118/ through localhost: SOCKS5 request failed\n\nJust try again to see if this is a temporary problem, or check your forwarding settings and make sure that all forwarding servers are working correctly and listening where they are supposed to be listening.\nRun Code Online (Sandbox Code Playgroud)\n\n所以可能与 SOCKS5\xe2\x80\xa6 的配置有关有人知道吗?
\n我的猜测是:
ps(例如,ps -ax | grep tor)和netstat(例如,对于 mac:。netstat -an | grep 'your tor portnumber'对于 linux,替换-an为-tulnp)以查看 Tor 是否真正在运行。forward-socks5t / 127.0.0.1:9050 .取消注释。| 归档时间: |
|
| 查看次数: |
3232 次 |
| 最近记录: |