如何将scrapyjs功能集成到Scrapy项目中

lor*_*771 6 javascript python scrapy web-scraping python-2.7

我使用Scrapy Framework制作了一个网络刮板,从这个网站获取音乐会门票数据.我已经能够成功地从页面上每个票证内的元素中获取数据,除了只能通过单击"票证"按钮进入票证页面并从票证中刮取票证价格来访问的价格.在页面上.

经过广泛的谷歌搜索,我发现Scrapy.js(基于Splash)可以在Scrapy中用于与页面上的JavaScript交互(例如需要点击的按钮).我已经看到了Splash用于与JavaScript交互的一些基本示例,但是没有一个示例Splash与Scrapy的集成(甚至在文档中都没有).

我一直在遵循使用项目加载器将scped元素存储在parse方法中的格式,然后发出一个请求,该请求应该转到另一个链接并通过调用第二个解析方法解析该页面中的html

(e.g. yield scrapy.Request(next_link, callback=self.parse_price)
Run Code Online (Sandbox Code Playgroud)

但是现在我将使用Scrapy js,这个代码会有所改变.为了整合Scrapyjs,我正在考虑使用与此类似的功能:

function main(splash)
  splash:go("http://example.com")
  splash:wait(0.5)
  local title = splash:evaljs("document.title")
return {title=title}
Run Code Online (Sandbox Code Playgroud)

这个网站,但由于javascript无法直接在python程序中编写,我如何/在哪里将该类函数合并到程序中,以便能够通过单击按钮导航到下一页并解析HTML?我显然非常擅长网络抓取,所以任何帮助都会非常感激.蜘蛛的代码如下:

concert_ticket_spider.py

from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from concert_comparator.items import ComparatorItem

bandname = raw_input("Enter a bandname \n")
vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html"

class MySpider(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.vividseats.com"]
    start_urls = [vs_url]
    #rules = (Rule(LinkExtractor(allow=('/' + bandname + '-.*', )), callback='parse_price'))
    # item = ComparatorItem()
    tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
    item_fields = {
        'eventName' : './/*[@class="productionsEvent"]/text()',
        'eventLocation' : './/*[@class = "productionsVenue"]/span[@itemprop  = "name"]/text()',
        'ticketsLink' : './/a/@href',
        'eventDate' : './/*[@class = "productionsDate"]/text()',
        'eventCity' : './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressLocality"]/text()',
        'eventState' : './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressRegion"]/text()',
        'eventTime' : './/*[@class = "productionsTime"]/text()'
    }


    item_fields2 = {
            'ticketPrice' : '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price]',


   }
    def parse_price(self, response):
            l.add_xpath('ticketPrice','.//*[@class =  "price"]/text()' )
            yield l.load_item()


        def parse(self, response):
            """

            """

        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):

            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
                yield Request(vs_url, self.parse_result, meta= {
                    'splash': {
                        'args':{
                            #set rendering arguments here
                            'html' :1

                            # 'url' is prefilled from request url
                        },
                        #optional parameters
                        function main(splash)
                            splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
                            splash:go(vs_url)
                            splash:runjs("$('#some-button').click()")
                            return splash:html()
                        end                    
                        }
                    })
                for field, xpath in self.item_fields2.iteritems():
                    loader.add_xpath(field, xpath)

            yield loader.load_item()
Run Code Online (Sandbox Code Playgroud)

ale*_*cxe 1

这里的关键点是scrapyjs提供了一个scrapyjs.SplashMiddleware需要您配置的中间件。然后,每个具有splash元密钥的请求都将由中间件处理。

仅供参考,我个人之前曾成功使用Scrapyscrapyjs