我如何在scrapy python中使用多个请求并在它们之间传递项目

use*_*027 42 python scrapy

我有item对象,我需要将它传递到许多页面以将数据存储在单个项目中

喜欢我的项目

class DmozItem(Item):
    title = Field()
    description1 = Field()
    description2 = Field()
    description3 = Field()
Run Code Online (Sandbox Code Playgroud)

现在这三个描述分为三个单独的页面.我想做些喜欢的事

现在这很适合 parseDescription1

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item
    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item
Run Code Online (Sandbox Code Playgroud)

但我想要类似的东西

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

def parseDescription2(self,response):
    item = response.meta['item']
    item['desc2'] = "test2"
    return item

def parseDescription3(self,response):
    item = response.meta['item']
    item['desc3'] = "test3"
    return item
Run Code Online (Sandbox Code Playgroud)

war*_*iuc 30

没问题.代替

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
      request.meta['item'] = item
      yield request

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
      yield request

      yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item
Run Code Online (Sandbox Code Playgroud)

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
      request.meta['item'] = item
      yield request

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
      yield request

      yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item
Run Code Online (Sandbox Code Playgroud)

  • 注意,该方法总共返回三个项目(每个项目包含一个'descX'键?).如果你想将desc(1,2,3)收集到一个项目中,你将不得不使用Dave McLain的方法或我提出的方法. (3认同)

小智 27

为了保证请求/回调的排序,并且最终只返回一个项目,您需要使用以下形式链接您的请求:

  def page_parser(self, response):
        sites = hxs.select('//div[@class="row"]')
        items = []

        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
        request.meta['item'] = Item()
        return [request]


  def parseDescription1(self,response):
        item = response.meta['item']
        item['desc1'] = "test"
        return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]


  def parseDescription2(self,response):
        item = response.meta['item']
        item['desc2'] = "test2"
        return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]

  def parseDescription3(self,response):
        item = response.meta['item']
        item['desc3'] = "test3"
        return [item]
Run Code Online (Sandbox Code Playgroud)

每个回调函数都返回一个可迭代的项目或请求,计划请求并通过项目管道运行项目.

如果你从每个回调中返回一个项目,你最终会在你的管道中找到4个不同完整状态的项目,但如果你返回下一个请求,那么你可以保证请求的顺序,你将完全拥有执行结束时的一个项目.

  • 如果您想要返回单个项目,这是要走的路.但是它暴露了一个问题:根据您的使用案例,某些parseDescription(1,2,3)方法可能会失败.如果他们这样做,该项目就会丢失.因此,请参阅我对此问题的建议的答案. (4认同)

oli*_*her 19

接受的答案总共返回三个项目[desc(i)设置为i = 1,2,3].

如果你想返回一个项目,戴夫·麦克莱恩的项目做的工作,但它需要parseDescription1,parseDescription2以及parseDescription3获得成功,并没有错误,以回报的项目上运行.

对于我的用例,一些子请求可以随机返回HTTP 403/404错误,因此我丢失了一些项目,即使我可以部分地删除它们.


解决方法

因此,我目前采用以下解决方法:不是仅在request.metadict中传递项目,而是传递一个知道接下来要调用的请求的调用堆栈.它将调用堆栈上的下一个项目(只要它不为空),并在堆栈为空时返回该项目.

errback请求参数用于返回到错误时调度方法,只是继续下一个堆栈的项目.

def callnext(self, response):
    ''' Call next target for the item loader, or yields it if completed. '''

    # Get the meta object from the request, as the response
    # does not contain it.
    meta = response.request.meta

    # Items remaining in the stack? Execute them
    if len(meta['callstack']) > 0:
        target = meta['callstack'].pop(0)
        yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
    else:
        yield meta['loader'].load_item()

def parseDescription1(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    # Build the call stack
    callstack = [
        {'url': "http://www.example.com/lin2.cpp",
        'callback': self.parseDescription2 },
        {'url': "http://www.example.com/lin3.cpp",
        'callback': self.parseDescription3 }
    ]

    return self.callnext(response)

def parseDescription2(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    return self.callnext(response)


def parseDescription3(self, response):

    # ...

    return self.callnext(response)
Run Code Online (Sandbox Code Playgroud)

警告

此解决方案仍然是同步的,如果回调中有任何异常,它仍然会失败.

有关更多信息,请查看我撰写的有关该解决方案的博文.