我想为一些蜘蛛启用一些http代理,并为其他蜘蛛禁用它们.
我可以这样做吗?
# settings.py
proxy_spiders = ['a1' , b2']
if spider in proxy_spider: #how to get spider name ???
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
else:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
Run Code Online (Sandbox Code Playgroud)
如果上面的代码不起作用,还有其他建议吗?
小智 34
有点晚了,但自1.0.0发布以来,scrapy中有一个新功能,你可以覆盖每个蜘蛛的设置,如下所示:
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
class MySpider2(scrapy.Spider):
name = "my_spider2"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
Run Code Online (Sandbox Code Playgroud)
Ami*_*ini 12
有一种新的更简单的方法可以做到这一点.
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
Run Code Online (Sandbox Code Playgroud)
我使用Scrapy 1.3.1
您可以在spider.py文件中添加setting.overrides有效的示例:
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 300
Run Code Online (Sandbox Code Playgroud)
对你来说,这样的事情也应该有效
from scrapy.conf import settings
settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
Run Code Online (Sandbox Code Playgroud)
您可以定义自己的代理中间件,如下所示:
from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware
class ConditionalProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
if getattr(spider, 'use_proxy', None):
return super(ConditionalProxyMiddleware, self).process_request(request, spider)
Run Code Online (Sandbox Code Playgroud)
use_proxy = True
然后在蜘蛛中定义您想要启用代理的属性。不要忘记禁用默认代理中间件并启用您修改过的代理中间件。