Scrapy + Splash + ScrapyJS

Question

Scrapy + Splash + ScrapyJS

psy*_*ok7 5 python screen-scraping scrapy scrapy-spider

我正在使用Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1,我仍然无法通过点击呈现JavaScript.以下是一个示例网址https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf

我仍然没有提供电话号码的页面:

class OlxSpider(scrapy.Spider):
    name = "olx"
    rotate_user_agent = True
    allowed_domains = ["olx.pt"]
    start_urls = [
        "https://olx.pt/imoveis/"
    ]

    def parse(self, response):
        script = """
        function main(splash)
            splash:go(splash.args.url)
            splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
            splash:wait(0.5)
            return splash:html()
        end
        """
        for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_house_contents, meta={
                'splash': {
                    'args': {'lua_source': script},
                    'endpoint': 'execute',
                }
            })

        for next_page in response.css('.pager .br3.brc8::attr(href)'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse)

    def parse_house_contents(self, response):

        import ipdb;ipdb.set_trace()

Run Code Online (Sandbox Code Playgroud)

我怎么能让这个工作？

Answer 1

ale*_*cxe 2

Splash您可以避免首先使用并发出适当的 GET 请求来自行获取电话号码。工作蜘蛛：

import json
import re

import scrapy   

class OlxSpider(scrapy.Spider):
    name = "olx"
    rotate_user_agent = True
    allowed_domains = ["olx.pt"]
    start_urls = [
        "https://olx.pt/imoveis/"
    ]

    def parse(self, response):
        for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_house_contents)

        for next_page in response.css('.pager .br3.brc8::attr(href)'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse)

    def parse_house_contents(self, response):
        property_id = re.search(r"ID(\w+)\.", response.url).group(1)

        phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id
        yield scrapy.Request(phone_url, callback=self.parse_phone)

    def parse_phone(self, response):
        phone_number = json.loads(response.body)["value"]
        print(phone_number)

Run Code Online (Sandbox Code Playgroud)

如果可以从这个“动态”网站中提取更多内容，请查看 Splash 是否真的足够，如果不够，请研究浏览器自动化和selenium.

归档时间：	9 年，11 月前
查看次数：	5453 次
最近记录：	9 年，11 月前