使用多个解析创建 Scrapy 项目数组

Question

使用多个解析创建 Scrapy 项目数组

Mat*_*mik 2 python arrays scrapy scrapy-spider

我正在用 Scrapy 抓取列表。我的脚本首先使用解析列表网址parse_node，然后使用解析每个列表parse_listing，对于每个列表，它使用解析列表的代理parse_agent。我想创建一个数组，该数组通过列表和列表的代理进行scrapy解析，并为每个新列表进行重置。

这是我的解析脚本：

 def parse_node(self,response,node):
  yield Request('LISTING LINK',callback=self.parse_listing)
 def parse_listing(self,response):
  yield response.xpath('//node[@id="ListingId"]/text()').extract_first()
  yield response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
  for agent in string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^'):
   yield Request('AGENT LINK',callback=self.parse_agent)
 def parse_agent(self,response):
  yield response.xpath('//node[@id="AgentName"]/text()').extract_first()
  yield response.xpath('//node[@id="AgentEmail"]/text()').extract_first()

Run Code Online (Sandbox Code Playgroud)

我希望 parse_listing 导致：

{
 'id':123,
 'title':'Amazing Listing'
}

Run Code Online (Sandbox Code Playgroud)

然后 parse_agent 添加到列表数组：

{
 'id':123,
 'title':'Amazing Listing'
 'agent':[
  {
   'name':'jon doe',
   'email:'jon.doe@email.com'
  },
  {
   'name':'jane doe',
   'email:'jane.doe@email.com'
  }
 ]
}

Run Code Online (Sandbox Code Playgroud)

如何从每个级别获取结果并建立一个数组？

Answer 1

Gra*_*rus 5

这是一个有点复杂的问题：
您需要从多个不同的 url 形成一个项目。

Scrapy 允许您在请求的元属性中携带数据，因此您可以执行以下操作：

def parse_node(self,response,node):
    yield Request('LISTING LINK', callback=self.parse_listing)

def parse_listing(self,response):
    item = defaultdict(list)
    item['id'] = response.xpath('//node[@id="ListingId"]/text()').extract_first()
    item['title'] = response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
    agent_urls = string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^')
    # find all agent urls and start with first one
    url = agent_urls.pop(0)
    # we want to go through agent urls one-by-one and update single item with agent data
    yield Request(url, callback=self.parse_agent, 
                  meta={'item': item, 'agent_urls' agent_urls})

def parse_agent(self,response):
    item = response.meta['item']  # retrieve item generated in previous request
    agent = dict() 
    agent['name'] = response.xpath('//node[@id="AgentName"]/text()').extract_first()
    agent['email'] =  response.xpath('//node[@id="AgentEmail"]/text()').extract_first()
    item['agents'].append(agent)
    # check if we have any more agent urls left
    agent_urls = response.meta['agent_urls']
    if not agent_urls:  # we crawled all of the agents!
        return item
    # if we do - crawl next agent and carry over our current item
    url = agent_urls.pop(0)
    yield Request(url, callback=self.parse_agent, 
                  meta={'item': item, 'agent_urls' agent_urls})

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，3 月前
查看次数：	2210 次
最近记录：	8 年，3 月前