使用链接网址隐藏的__doPostBack进行刮擦

use*_*030 8 javascript python scrapy dopostback

我试图从使用__doPostBack功能的网站搜索搜索结果.该网页每个搜索查询显示10个结果.要查看更多结果,必须单击触发__doPostBackjavascript 的按钮.经过一些研究,我意识到POST请求的行为就像一个表单,并且可以简单地使用scrapy FormRequest来填充该表单.我用了以下帖子:

使用scrapy与javascript __doPostBack方法的麻烦

编写以下脚本.

# -*- coding: utf-8 -*- 
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.selector import Selector
from ahram.items import AhramItem
import re

class MySpider(CrawlSpider):
    name = u"el_ahram2"

    def start_requests(self):
        search_term = u'??????'
        baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
        requests = []
        for i in range(1, 4):#crawl first 3 pages as a test
            argument =  u"'Page$"+ str(i+1) + u"'"
            data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument}
            currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles)
            requests.append(currentPage)
        return requests

    def fetch_articles(self, response):
        sel = Selector(response)
        for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract(): 
            yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items)

    def parse_items(self, response):
        sel = Selector(response)
        the_title = ' '.join(sel.xpath("//title/text()").extract()).replace('\n','').replace('\r','').replace('\t','')#* mean 'anything'
        the_authors = '---'.join(sel.xpath("//*[contains(@id,'editorsdatalst_HyperLink')]//text()").extract())
        the_text = ' '.join(sel.xpath("//span[@id='TextBox2']/text()").extract())
        the_month_year = ' '.join(sel.xpath("string(//span[@id = 'Label1'])").extract())
        the_day = ' '.join(sel.xpath("string(//span[@id = 'Label2'])").extract())
        item = AhramItem()
        item["Authors"] = the_authors
        item["Title"] = the_title
        item["MonthYear"] = the_month_year
        item["Day"] = the_day
        item['Text'] = the_text
        return item
Run Code Online (Sandbox Code Playgroud)

我现在的问题是永远不会调用'fetch_articles':

2014-05-27 12:19:12+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST     http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] INFO: Closing spider (finished)
Run Code Online (Sandbox Code Playgroud)

经过几天的搜索,我觉得完全卡住了.我是python的初学者,所以错误也许是微不足道的.但是,如果不是这样,该线程可能对许多人有用.提前感谢您的帮助.

Jon*_*den 4

你的代码没问题。fetch_articles在跑。您可以通过添加打印语句来测试它。

\n\n

但是,该网站要求您验证 POST 请求。为了验证它们,您必须在请求正文中包含__EVENTVALIDATION和来证明您正在响应他们的表单。__VIEWSTATE为了获取这些,您需要首先发出 GET 请求,并从表单中提取这些字段。如果您不提供此信息,则会收到一个错误页面,其中不包含任何带有“checkpart.aspx?Serial=”的链接,因此您的for循环没有被执行。

\n\n

这是我如何设置start_request,然后fetch_search执行start_request以前执行的操作。

\n\n
class MySpider(CrawlSpider):\n    name = u"el_ahram2"\n\n    def start_requests(self):\n        search_term = u\'\xd8\xa7\xd9\x82\xd8\xaa\xd8\xb5\xd8\xa7\xd8\xaf\'\n        baseUrl = u\'http://digital.ahram.org.eg/sresult.aspx?srch=\' + search_term + u\'&archid=1\'\n        SearchPage = Request(baseUrl, callback = self.fetch_search)\n        return [SearchPage]\n\n    def fetch_search(self, response):\n        sel = Selector(response)\n        search_term = u\'\xd8\xa7\xd9\x82\xd8\xaa\xd8\xb5\xd8\xa7\xd8\xaf\'\n        baseUrl = u\'http://digital.ahram.org.eg/sresult.aspx?srch=\' + search_term + u\'&archid=1\'\n        viewstate = sel.xpath("//input[@id=\'__VIEWSTATE\']/@value").extract().pop()\n        eventvalidation = sel.xpath("//input[@id=\'__EVENTVALIDATION\']/@value").extract().pop()\n        for i in range(1, 4):#crawl first 3 pages as a test\n            argument =  u"\'Page$"+ str(i+1) + u"\'"\n            data = {\'__EVENTTARGET\': u"\'GridView1\'", \'__EVENTARGUMENT\': argument, \'__VIEWSTATE\': viewstate, \'__EVENTVALIDATION\': eventvalidation}\n            currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles)\n            yield currentPage\n\n    ...\n
Run Code Online (Sandbox Code Playgroud)\n