在一定数量的请求后如何停止抓痒的蜘蛛？

Question

在一定数量的请求后如何停止抓痒的蜘蛛？

Sai*_*ran 4 python loops scrapy python-2.7 python-3.x

我正在开发一个简单的刮板，可获取9个插口及其图像，但由于某些技术困难，我无法停止刮板，并且我一直不希望刮板继续刮下去。我想增加计数器值并在100个柱后停止。但是9gag页面的设计方式是每次响应仅给出10个帖子，并且每次迭代后，我的计数器值都重置为10，在这种情况下，我的循环运行了很长时间，并且永不停止。

# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem

class FirstSpider(scrapy.Spider):
    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = (
        'http://www.9gag.com/',
    )

    last_gag_id = None
    def parse(self, response):
        count = 0
        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            count +=1
            if gag_id:
                if (count != 100):
                    last_gag_id = gag_id[0]
                    ninegag_item = GagItem()
                    ninegag_item['entry_id'] = gag_id[0]
                    ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
                    ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
                    ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
                    ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
                    ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

                    yield ninegag_item


                else:
                    break


        next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
        yield scrapy.Request(url=next_url, callback=self.parse) 
        print count

Run Code Online (Sandbox Code Playgroud)

这里是items.py的代码

from scrapy.item import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()

Run Code Online (Sandbox Code Playgroud)

所以我想增加一个全局计数值，并通过传递3个参数来解析函数来尝试此操作，它给出了错误

TypeError: parse() takes exactly 3 arguments (2 given)

Run Code Online (Sandbox Code Playgroud)

因此，有一种方法可以传递全局计数值，并在每次迭代后返回它，并在100个帖子（假设）后停止。

整个项目可在这里 Github上即使我设置POST_LIMIT = 100的无限循环发生，在这里看到我的命令执行

scrapy crawl first -s POST_LIMIT=10 --output=output.json

Run Code Online (Sandbox Code Playgroud)

Answer 1

Fra*_*tin 6

第一：self.count在parse. 然后不要阻止项目的解析，而是生成新的requests. 请参阅以下代码：

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )

    last_gag_id = None
    COUNT_MAX = 30
    count = 0

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
            ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            self.count = self.count + 1
            yield ninegag_item

        if (self.count < self.COUNT_MAX):
            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 6

有一个内置设置CLOSESPIDER_PAGECOUNT，可以通过命令行-s参数传递或更改设置：scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

一个小警告是，如果启用了缓存，它将缓存命中数也计入页数。

归档时间：	9 年，8 月前
查看次数：	5461 次
最近记录：	7 年，3 月前