使用scrapy的分页

Question

使用scrapy的分页

Van*_*del 6 python request scrapy web-scraping

我正在尝试抓取这个网站:http: //www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html

我可以在此页面中获取所有产品,但如何在页面底部发出"查看更多"链接请求？

我的代码到现在为止:

rules = (
    Rule(SgmlLinkExtractor(restrict_xpaths='//li[@class="normalLeft"]/div/a',unique=True)),
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="topParentChilds"]/div/div[@class="clm2"]/a',unique=True)),
    Rule(SgmlLinkExtractor(restrict_xpaths='//p[@class="proHead"]/a',unique=True)),
    Rule(SgmlLinkExtractor(allow=('http://[^/]+/[^/]+/[^/]+/[^/]+$', ), deny=('/about-us/about-us/contact-us', './music.html',  ) ,unique=True),callback='parse_item'),
)

Run Code Online (Sandbox Code Playgroud)

有帮助吗？

Answer 1

ale*_*cxe 10

首先,你应该看看这个关于如何处理抓取ajax动态加载内容的线程: scrapy可以用来从使用AJAX的网站中抓取动态内容吗？

因此,单击"查看更多"按钮会触发XHR请求:

http://www.aido.com/eshop/faces/tiles/category.jsp?q=&categoryID=189&catalogueID=2&parentCategoryID=185&viewType=grid&bnm=&atmSize=&format=&gender=&ageRange=&actor=&director=&author=&region=&compProductType=&compOperatingSystem=&compScreenSize=&compCpuSpeed=&compRam=&compGraphicProcessor=&compDedicatedGraphicMemory=&mobProductType=&mobOperatingSystem=&mobCameraMegapixels=&mobScreenSize=&mobProcessor=&mobRam=&mobInternalStorage=&elecProductType=&elecFeature=&elecPlaybackFormat=&elecOutput=&elecPlatform=&elecMegaPixels=&elecOpticalZoom=&elecCapacity=&elecDisplaySize=&narrowage=&color=&prc=&k1=&k2=&k3=&k4=&k5=&k6=&k7=&k8=&k9=&k10=&k11=&k12=&startPrize=&endPrize=&newArrival=&entityType=&entityId=&brandId=&brandCmsFlag=&boutiqueID=&nmt=&disc=&rat=&cts=empty&isBoutiqueSoldOut=undefined&sort=12&isAjax=true&hstart=24&targetDIV=searchResultDisplay

Run Code Online (Sandbox Code Playgroud)

返回text/html接下来的24个项目.注意这个hstart=24参数:第一次点击"查看更多"时它等于24,第二次 - 48等等.这应该是你的救星.

现在,您应该在蜘蛛中模拟这些请求.建议的方法是实例化scrapy的Request对象,提供回调,您将提取数据.

希望有所帮助.

归档时间：	12 年，6 月前
查看次数：	8600 次
最近记录：	8 年，6 月前