Nil*_*esh 6 python twisted socks scrapy web-scraping
我正试图在Tor上使用scrapy.我一直试图了解如何编写一个使用socksipy连接的scrapy的DownloadHandler.
Scrapy的HTTP11DownloadHandler在这里:https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py
以下是创建自定义下载处理程序的示例:https: //github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py
以下是创建SocksiPyConnection类的代码:http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
class SocksiPyConnection(httplib.HTTPConnection):
def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
httplib.HTTPConnection.__init__(self, *args, **kwargs)
def connect(self):
self.sock = socks.socksocket()
self.sock.setproxy(*self.proxyargs)
if isinstance(self.timeout, float):
self.sock.settimeout(self.timeout)
self.sock.connect((self.host, self.port))
Run Code Online (Sandbox Code Playgroud)
由于scrapy代码中扭曲反应器的复杂性,我无法弄清楚插件是如何插入它的.有什么想法吗?
请不要使用类似于privxy的替代方案或回复说"scrapy不能与socks代理一起使用" - 我知道这就是为什么我正在尝试编写一个使用socksipy发出请求的自定义Downloader.
我能够使用https://github.com/habnabit/txsocksx完成这项工作.
做了之后pip install txsocksx,我需要更换scrapy的ScrapyAgent使用txsocksx.http.SOCKS5Agent.
我只是复制代码HTTP11DownloadHandler,并ScrapyAgent从scrapy/core/downloader/handlers/http.py,子类他们写了这样的代码:
class TorProxyDownloadHandler(HTTP11DownloadHandler):
def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
return agent.download_request(request)
class ScrapyTorAgent(ScrapyAgent):
def _get_agent(self, request, timeout):
bindaddress = request.meta.get('bindaddress') or self._bindAddress
proxy = request.meta.get('proxy')
if proxy:
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
scheme = _parse(request.url)[0]
omitConnectTunnel = proxyParams.find('noconnect') >= 0
if scheme == 'https' and not omitConnectTunnel:
proxyConf = (proxyHost, proxyPort,
request.headers.get('Proxy-Authorization', None))
return self._TunnelingAgent(reactor, proxyConf,
contextFactory=self._contextFactory, connectTimeout=timeout,
bindAddress=bindaddress, pool=self._pool)
else:
_, _, host, port, proxyParams = _parse(request.url)
proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
timeout=timeout, bindAddress=bindaddress)
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
return agent
return self._Agent(reactor, contextFactory=self._contextFactory,
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
Run Code Online (Sandbox Code Playgroud)
在settings.py中,需要这样的东西:
DOWNLOAD_HANDLERS = {
'http': 'crawler.http.TorProxyDownloadHandler'
}
Run Code Online (Sandbox Code Playgroud)
现在通过像Tor这样的袜子代理工作代理Scrapy.
| 归档时间: |
|
| 查看次数: |
3098 次 |
| 最近记录: |