Xod*_*777 6 javascript python asp.net scrapy web-scraping
我正在通过Scrapy通过ASP.NET编程爬行一些目录.
要爬网的页面编码如下:
javascript:__doPostBack('ctl00$MainContent$List','Page$X')
其中X是介于1和180之间的int.MainContent参数始终相同.我不知道如何爬进这些.我喜欢添加一些简单如系统性红斑狼疮规则allow=('Page$')或attrs='__doPostBack',但我的猜测是,我必须以拉从JavaScript中的信息棘手的"链接".
如果更容易从javascript代码"取消屏蔽"每个绝对链接并将其保存到csv,那么使用该csv将请求加载到新的scraper中,这也没关系.
ale*_*cxe 16
这种分页并不像看起来那么微不足道.解决它是一个有趣的挑战.关于下面提供的解决方案有几个重要说明:
Request.meta字典BaseSpider因为分页中涉及一些逻辑headers假装成真正的浏览器很重要FormRequest,dont_filter=True所以产生s 非常重要POST代码:
import re
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
HEADERS = {
'X-MicrosoftAjax': 'Delta=true',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
class ExitRealtySpider(BaseSpider):
name = "exit_realty"
allowed_domains = ["exitrealty.com"]
start_urls = [URL]
def parse(self, response):
# submit a form (first page)
self.data = {}
for form_input in response.css('form#aspnetForm input'):
name = form_input.xpath('@name').extract()[0]
try:
value = form_input.xpath('@value').extract()[0]
except IndexError:
value = ""
self.data[name] = value
self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
self.data['__EVENTARGUMENT'] = 'Page$1'
return FormRequest(url=URL,
method='POST',
callback=self.parse_page,
formdata=self.data,
meta={'page': 1},
dont_filter=True,
headers=HEADERS)
def parse_page(self, response):
current_page = response.meta['page'] + 1
# parse agents (TODO: yield items instead of printing)
for agent in response.xpath('//a[@class="regtext"]/text()'):
print agent.extract()
print "------"
# request the next page
data = {
'__EVENTARGUMENT': 'Page$%d' % current_page,
'__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__ASYNCPOST': 'true',
'__EVENTTARGET': 'ctl00$MainContent$agentList',
'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
'': ''
}
return FormRequest(url=URL,
method='POST',
formdata=data,
callback=self.parse_page,
meta={'page': current_page},
dont_filter=True,
headers=HEADERS)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4010 次 |
| 最近记录: |