使用带有splash的InitSpider:只解析登录页面？

Question

使用带有splash的InitSpider:只解析登录页面？

gog*_*urt 12 python splash-screen scrapy

这是我之前提到过的一个后续问题.

我正在尝试抓取一个我必须首先登录的网页.但经过身份验证后,我需要的网页需要运行一些Javascript才能查看内容.我所做的是按照这里的说明安装splash来尝试渲染Javascript.然而...

在我切换到启动之前,使用Scrapy进行身份验证InitSpider很好.我正在通过登录页面并抓取目标页面确定(显然,除非没有Javascript工作).但是,一旦我添加代码以通过启动传递请求,看起来我似乎没有解析目标页面.

下面的蜘蛛.启动版本(此处)与非启动版本之间的唯一区别是该功能def start_requests().两者之间的其他一切都是一样的.

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
            "http://www.bridgebase.com/myhands/index.php"
            ]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F" 

    # authentication
    def init_request(self):
        return scrapy.http.Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return scrapy.http.FormRequest.from_response(
            response,
            formdata={'username': 'USERNAME', 'password': 'PASSWORD'},
            callback=self.check_login_response)

    def check_login_response(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
            return self.initialized()
        else:
            self.log("Login failed")
            print(response.body)

    # pipe the requests through splash so the JS renders 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            }) 

    # what to do when a link is encountered
    rules = (
            Rule(LinkExtractor(), callback='parse_item'),
            )

    # do nothing on new link for now
    def parse_item(self, response):
        pass

    def parse(self, response):
        filename = 'test.html' 
        with open(filename, 'wb') as f:
            f.write(response.body)

Run Code Online (Sandbox Code Playgroud)

现在发生的事情是test.html,结果parse(),现在只是登录页面本身,而不是我应该在登录后重定向到的页面.

这是在日志中告诉 - 通常,我会看到"登录成功"行check_login_response(),但正如您在下面看到的,似乎我甚至没有达到这一步.这是因为scrapy现在也通过启动认证请求,并且它被挂起了吗？如果是这种情况,有没有办法只绕过认证部分的启动？

2016-01-24 14:54:56 [scrapy] INFO: Spider opened
2016-01-24 14:54:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 14:54:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 14:55:02 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
2016-01-24 14:55:02 [scrapy] INFO: Closing spider (finished)

Run Code Online (Sandbox Code Playgroud)

我很确定我没有正确使用防溅.谁能指点我一些文件,我可以弄清楚发生了什么？

Answer 1

ale*_*cxe 6

我认为，仅Splash不能很好地处理此特殊情况。

这是工作思路：

使用selenium和PhantomJS无头的浏览器登录到网站
将浏览器cookie从传递PhantomJS到Scrapy

代码：

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class BboSpider(scrapy.Spider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        driver = webdriver.PhantomJS()
        driver.get(self.login_page)

        driver.find_element_by_id("username").send_keys("user")
        driver.find_element_by_id("password").send_keys("password")

        driver.find_element_by_name("submit").click()

        driver.save_screenshot("test.png")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))

        cookies = driver.get_cookies()
        driver.close()

        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)

    def parse(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
        else:
            self.log("Login failed")
        print(response.body)

Run Code Online (Sandbox Code Playgroud)

打印Login successful和“手”页面的HTML。

归档时间：	10 年，4 月前
查看次数：	1241 次
最近记录：	10 年，3 月前