小编cod*_*eer的帖子

在每个域上以高并发和低请求率执行 Scrapy 广泛抓取。

我正在尝试进行 Scrapy 广泛的爬行。目标是在不同的域进行许多并发爬网,但同时在每个域上轻轻爬行。因此能够保持良好的爬行速度并保持每个 url 上的请求频率较低。

这是我使用的蜘蛛:

import re
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem

class testSpider(CrawlSpider):
    name = "testCrawler16"
    start_urls = [
              "http://example.com",
    ]

    extractor = SgmlLinkExtractor(deny=('.com','.nl','.org'),
                              allow=('.se'))

    rules = (
        Rule(extractor,callback='parse_links',follow=True),
        )

    def parse_links(self, response):
        item = MyprojectItem()
        item['url'] =response.url
        item['depth'] = response.meta['depth']
        yield item
Run Code Online (Sandbox Code Playgroud)

这是我使用的设置:

BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

REACTOR_THREADPOOL_MAXSIZE = 20
RETRY_ENABLED = False
REDIRECT_ENABLED = False
DOWNLOAD_TIMEOUT = 15
LOG_LEVEL = 'INFO' …
Run Code Online (Sandbox Code Playgroud)

concurrency scrapy web-scraping python-2.7

5
推荐指数
1
解决办法
2362
查看次数

标签 统计

concurrency ×1

python-2.7 ×1

scrapy ×1

web-scraping ×1