Jos*_*aar 5 scrapy web-scraping python-3.x
我正在编写一个scrapy脚本来从保罗克鲁格曼的纽约时报博客中提取最新的博客文章。该项目进展顺利,但是当我进入实际尝试提取数据的阶段时,我一直遇到同样的问题:
ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://krugman.blogs.nytimes.com/more_posts_jsons/page/1/?homepage=1&apagenum=1>
Run Code Online (Sandbox Code Playgroud)
我正在使用的代码如下:
from scrapy import http
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
import scrapy
from tutorial.items import BlogPost
class krugSpider(CrawlSpider):
name = 'krugbot'
start_url = ['https://krugman.blogs.nytimes.com']
def __init__(self):
self.url = 'https://krugman.blogs.nytimes.com/more_posts_jsons/page/{0}/?homepage=1&apagenum={0}'
def start_requests(self):
yield http.Request(self.url.format('1'), callback = self.parse_page)
def parse_page(self, response):
data = json.loads(response.body)
for block in range(len(data['posts'])):
yield self.parse_block(data['posts'][block])
page = data['args']['paged'] + 1
url = self.url.format(str(page))
yield http.Request(url, callback = self.parse_page)
def parse_block(self, block):
for content in block:
article = BlogPost(author = 'Paul Krugman', source = 'Blog')
paragraphs = Selector(text = content['html'])
article['paragraphs']= paragraphs.xpath('article/p').extract()
article['datetime'] = content['post_date']
article['post_id'] = content['post_id']
article['url'] = content['permalink']
article['title'] = content['headline']
yield article
Run Code Online (Sandbox Code Playgroud)
作为参考,items.py 文件是:
from scrapy import Item, Field
class BlogPost(Item):
author = Field()
source = Field()
datetime = Field()
url = Field()
post_id = Field()
title = Field()
paragraph = Field()
Run Code Online (Sandbox Code Playgroud)
该程序应该返回scrapy 'Item' 类对象和非生成器,所以我不确定它为什么要返回一个生成器。有什么建议吗?
self.parse_block(data['posts'][block])我相信您也可以使用以下内容,而不是像已接受的答案中那样迭代并生成每个项目yield from:
yield from self.parse_block(data['posts'][block])
Run Code Online (Sandbox Code Playgroud)
这是因为你在里面产生了一个发电机parse_page。检查这一行:
yield self.parse_block(data['posts'][block])
Run Code Online (Sandbox Code Playgroud)
产生 的输出parse_block, parse_block 返回一个generator(因此它也产生多个对象)。
如果您将其更改为:
for block in range(len(data['posts'])):
for article in self.parse_block(data['posts'][block]):
yield article
Run Code Online (Sandbox Code Playgroud)