使用Privoxy和Tor进行Scrapy:如何更新IP

17 python tor scrapy web-scraping

我正在处理Scrapy,Privoxy和Tor.我已安装并正常工作.但Tor每次都使用相同的IP连接,因此我很容易被禁止.是否有可能告诉Tor重新连接每个X秒或连接?

谢谢!

编辑配置:对于用户代理池我做了这个:http://tangww.com/2013/06/UsingRandomAgent/ (我必须在评论中说出一个_ init _.py文件),以及对于Privoxy和Tor我遵循http://www.andrewwatters.com/privoxy/(我必须手动创建私人用户和私人组与终端).有效 :)

我的蜘蛛是这样的:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "spider_name"
    start_urls = [
    'https://example.com/listviews/titles.php',
    ]
    allowed_domains = ["example.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('main#main'):
            yield {
                'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
            }
Run Code Online (Sandbox Code Playgroud)

在settings.py中,我有一个用户代理轮换和privoxy:

DOWNLOADER_MIDDLEWARES = {
        #user agent
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'spider_name.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
        #privoxy
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        'spider_name.middlewares.ProxyMiddleware': 100
    }
Run Code Online (Sandbox Code Playgroud)

在middlewares.py我补充说:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])
Run Code Online (Sandbox Code Playgroud)

而且我认为这就是......

编辑II ---

好的,我改变了我的中间件.py文件,如博客@TomášLinhart所说:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])
Run Code Online (Sandbox Code Playgroud)

from stem import Signal
from stem.control import Controller

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

    def set_new_ip():
        with Controller.from_port(port=9051) as controller:
            controller.authenticate(password='tor_password')
            controller.signal(Signal.NEWNYM)
Run Code Online (Sandbox Code Playgroud)

但现在真的很慢,而且似乎没有改变ip ...我做得好或者出了什么问题?

Tom*_*art 9

博客文章可能会帮助您处理相同的问题.

编辑:基于具体要求(对于每个请求或之后新的IP Ñ请求),把适当的呼叫到set_new_ipprocess_request中间件的方法.但是请注意,对set_new_ip函数的调用并不总是要确保新的IP(有一个指向FAQ的链接,并附有说明).

EDIT2:带有ProxyMiddleware类的模块如下所示:

from stem import Signal
from stem.control import Controller

def _set_new_ip():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate(password='tor_password')
        controller.signal(Signal.NEWNYM)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        _set_new_ip()
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])
Run Code Online (Sandbox Code Playgroud)


Duš*_*ďar 7

但Tor每次都使用相同的IP连接

这是一个记录在案的Tor功能:

需要注意的一点是,新电路并不一定意味着新的IP地址.根据速度和稳定性等启发式随机选择路径.Tor网络中只有很多大型出口,因此重复使用之前的出口并不罕见.

这就是为什么使用下面的代码会导致重新使用相同的IP地址的原因.

from stem import Signal
from stem.control import Controller


with Controller.from_port(port=9051) as controller:
    controller.authenticate(password='tor_password')
    controller.signal(Signal.NEWNYM)
Run Code Online (Sandbox Code Playgroud)


https://github.com/DusanMadar/TorIpChanger可帮助您管理此行为.入学 - 我写道TorIpChanger.

我还整理了一篇关于如何在Tor和Privoxy中使用Python的指南:https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b.


这是一个如何在你的中使用TorIpChanger(pip install toripchanger)的例子ProxyMiddleware.

from toripchanger import TorIpChanger


# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)


class ProxyMiddleware(object):
    def process_request(self, request, spider):
        ip_changer.get_new_ip()
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])
Run Code Online (Sandbox Code Playgroud)

或者,如果您想在10次请求后使用其他IP,则可以执行以下操作.

from toripchanger import TorIpChanger


# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)


class ProxyMiddleware(object):
    _requests_count = 0

    def process_request(self, request, spider):
        self._requests_count += 1
        if self._requests_count > 10:
            self._requests_count = 0 
            ip_changer.get_new_ip()

        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])
Run Code Online (Sandbox Code Playgroud)