我正在处理Scrapy,Privoxy和Tor.我已安装并正常工作.但Tor每次都使用相同的IP连接,因此我很容易被禁止.是否有可能告诉Tor重新连接每个X秒或连接?
谢谢!
编辑配置:对于用户代理池我做了这个:http://tangww.com/2013/06/UsingRandomAgent/ (我必须在评论中说出一个_ init _.py文件),以及对于Privoxy和Tor我遵循http://www.andrewwatters.com/privoxy/(我必须手动创建私人用户和私人组与终端).有效 :)
我的蜘蛛是这样的:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "spider_name"
start_urls = [
'https://example.com/listviews/titles.php',
]
allowed_domains = ["example.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = …Run Code Online (Sandbox Code Playgroud)