我有一个形式的网址:
example.com/foo/bar/page_1.html
Run Code Online (Sandbox Code Playgroud)
总共有53页,每页有~20行.
我基本上想要从所有页面获取所有行,即~53*20项.
我在我的parse方法中有工作代码,它解析单个页面,并且每个项目也更深入一页,以获得有关该项目的更多信息:
def parse(self, response):
hxs = HtmlXPathSelector(response)
restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')
for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]
# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)
request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request
def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item
Run Code Online (Sandbox Code Playgroud)
问题是,我如何抓取每个页面?
example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
Run Code Online (Sandbox Code Playgroud)
Ach*_*him 40
您有两种方法可以解决您的问题.一般yield用于生成新请求而不是return.这样,您可以从单个回调中发出多个新请求.查看http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example中的第二个示例.
在你的情况下,可能有一个更简单的解决方案:只需从这样的模式中生成start urs列表:
class MySpider(BaseSpider):
start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
Run Code Online (Sandbox Code Playgroud)
bsl*_*ima 11
您可以使用CrawlSpider而不是BaseSpider,并使用SgmlLinkExtractor提取分页中的页面.
例如:
start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
, follow= True),
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
, callback='parse_call')
)
Run Code Online (Sandbox Code Playgroud)
第一条规则告诉scrapy遵循xpath表达式中包含的链接,第二条规则告诉scrapy将parse_call调用xpath表达式中包含的链接,以防您想要解析每个页面中的内容.
有关详细信息,请参阅doc:http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
小智 7
"scrapy - 解析分页的项目"可以有两个用例.
一个).我们只想移动表并获取数据.这是相对简单的.
class TrainSpider(scrapy.Spider):
name = "trip"
start_urls = ['somewebsite']
def parse(self, response):
''' do something with this parser '''
next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Run Code Online (Sandbox Code Playgroud)
观察最后4行.这里
parse回调方法的递归调用.B)我们不仅要跨页面移动,还要从该页面中的一个或多个链接中提取数据.
class StationDetailSpider(CrawlSpider):
name = 'train'
start_urls = [someOtherWebsite]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
)
def parse_trains(self, response):
'''do your parsing here'''
Run Code Online (Sandbox Code Playgroud)
无处不在,请注意:
我们正在使用父类的CrawlSpider子scrapy.Spider类
我们设置为'规则'
a)第一条规则,只检查是否有"next_page"可用并跟随它.
b)第二个规则请求页面上所有格式的链接/trains/12343,然后调用parse_trains执行和解析操作.
重要提示:请注意,parse由于我们使用的是CrawlSpider子类,因此我们不希望在此处使用常规方法.这个类也有一个parse方法,所以我们不想覆盖它.请记住将您的回叫方法命名为其他方式parse.