我正在尝试进行 Scrapy 广泛的爬行。目标是在不同的域进行许多并发爬网,但同时在每个域上轻轻爬行。因此能够保持良好的爬行速度并保持每个 url 上的请求频率较低。
这是我使用的蜘蛛:
import re
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem
class testSpider(CrawlSpider):
name = "testCrawler16"
start_urls = [
"http://example.com",
]
extractor = SgmlLinkExtractor(deny=('.com','.nl','.org'),
allow=('.se'))
rules = (
Rule(extractor,callback='parse_links',follow=True),
)
def parse_links(self, response):
item = MyprojectItem()
item['url'] =response.url
item['depth'] = response.meta['depth']
yield item
Run Code Online (Sandbox Code Playgroud)
这是我使用的设置:
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
REACTOR_THREADPOOL_MAXSIZE = 20
RETRY_ENABLED = False
REDIRECT_ENABLED = False
DOWNLOAD_TIMEOUT = 15
LOG_LEVEL = 'INFO' …Run Code Online (Sandbox Code Playgroud)