gog*_*urt 12 python splash-screen scrapy
这是我之前提到过的一个后续问题.
我正在尝试抓取一个我必须首先登录的网页.但经过身份验证后,我需要的网页需要运行一些Javascript才能查看内容.我所做的是按照这里的说明安装splash来尝试渲染Javascript.然而...
在我切换到启动之前,使用Scrapy进行身份验证InitSpider很好.我正在通过登录页面并抓取目标页面确定(显然,除非没有Javascript工作).但是,一旦我添加代码以通过启动传递请求,看起来我似乎没有解析目标页面.
下面的蜘蛛.启动版本(此处)与非启动版本之间的唯一区别是该功能def start_requests().两者之间的其他一切都是一样的.
import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class BboSpider(InitSpider):
name = "bbo"
allowed_domains = ["bridgebase.com"]
start_urls = [
"http://www.bridgebase.com/myhands/index.php"
]
login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"
# authentication
def init_request(self):
return scrapy.http.Request(url=self.login_page, callback=self.login)
def login(self, response):
return scrapy.http.FormRequest.from_response(
response,
formdata={'username': 'USERNAME', 'password': 'PASSWORD'},
callback=self.check_login_response)
def check_login_response(self, response):
if "recent tournaments" in response.body:
self.log("Login successful")
return self.initialized()
else:
self.log("Login failed")
print(response.body)
# pipe the requests through splash so the JS renders
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
# what to do when a link is encountered
rules = (
Rule(LinkExtractor(), callback='parse_item'),
)
# do nothing on new link for now
def parse_item(self, response):
pass
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
Run Code Online (Sandbox Code Playgroud)
现在发生的事情是test.html,结果parse(),现在只是登录页面本身,而不是我应该在登录后重定向到的页面.
这是在日志中告诉 - 通常,我会看到"登录成功"行check_login_response(),但正如您在下面看到的,似乎我甚至没有达到这一步.这是因为scrapy现在也通过启动认证请求,并且它被挂起了吗?如果是这种情况,有没有办法只绕过认证部分的启动?
2016-01-24 14:54:56 [scrapy] INFO: Spider opened
2016-01-24 14:54:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 14:54:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 14:55:02 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
2016-01-24 14:55:02 [scrapy] INFO: Closing spider (finished)
Run Code Online (Sandbox Code Playgroud)
我很确定我没有正确使用防溅.谁能指点我一些文件,我可以弄清楚发生了什么?
我认为,仅Splash不能很好地处理此特殊情况。
这是工作思路:
selenium和PhantomJS无头的浏览器登录到网站PhantomJS到Scrapy代码:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class BboSpider(scrapy.Spider):
name = "bbo"
allowed_domains = ["bridgebase.com"]
login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"
def start_requests(self):
driver = webdriver.PhantomJS()
driver.get(self.login_page)
driver.find_element_by_id("username").send_keys("user")
driver.find_element_by_id("password").send_keys("password")
driver.find_element_by_name("submit").click()
driver.save_screenshot("test.png")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))
cookies = driver.get_cookies()
driver.close()
yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)
def parse(self, response):
if "recent tournaments" in response.body:
self.log("Login successful")
else:
self.log("Login failed")
print(response.body)
Run Code Online (Sandbox Code Playgroud)
打印Login successful和“手”页面的HTML。