我想为一些蜘蛛启用一些http代理,并为其他蜘蛛禁用它们.
我可以这样做吗?
# settings.py
proxy_spiders = ['a1' , b2']
if spider in proxy_spider: #how to get spider name ???
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
else:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
Run Code Online (Sandbox Code Playgroud)
如果上面的代码不起作用,还有其他建议吗?
我有这样一段python代码:
import json
single_quote = '{"key": "value"}'
double_quote = "{'key': 'value'}"
data = json.loads(single_quote) # get a dict: {'key': 'value'}
data = json.loads(double_quote) # get a ValueError: Expecting property name: line 1 column 2 (char 1)
Run Code Online (Sandbox Code Playgroud)
在python中,single_quote并double_quote没有技术差异,不是吗?那么为什么single_quote有效double_quote呢?
我想知道Scrapy如何过滤那些被抓取的网址?它是否存储了所有被抓取的网址crawled_urls_list,当它获得一个新网址时,它会查找列表以检查该网址是否存在?
CrawlSpider的这个过滤部分的代码在哪里(/path/to/scrapy/contrib/spiders/crawl.py)?
非常感谢!