J. *_*tra 5 python scrapy python-3.x scrapy-splash splash-js-render
我正在尝试使用以下代码登录网站(针对本文稍作修改):
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
class Login_me(scrapy.Spider):
name = 'espn'
allowed_domains = ['games.espn.com']
start_urls = ['http://games.espn.com/ffl/leaguerosters?leagueId=774630']
def start_requests(self):
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(10))
local search_input = splash:select('input[type=email]')
search_input:send_text("user email")
local search_input = splash:select('input[type=password]')
search_input:send_text("user password!")
assert(splash:wait(10))
local submit_button = splash:select('input[type=submit]')
submit_button:click()
assert(splash:wait(10))
return html = splash:html()
end
"""
yield SplashRequest(
'http://games.espn.com/ffl/leaguerosters?leagueId=774630',
callback=self.after_login,
endpoint='execute',
args={'lua_source': script}
)
def after_login(self, response):
table = response.xpath('//table[@id="playertable_0"]')
for player in table.css('tr[id]'):
item = {
'id': player.css('::attr(id)').extract_first(),
}
yield item
print(item)
Run Code Online (Sandbox Code Playgroud)
我收到错误:
<GET http://games.espn.com/ffl/signin?redir=http%3A%2F%2Fgames.espn.com%2Fffl%2Fleaguerosters%3FleagueId%3D774630> from <GET http://games.espn.com/ffl/leaguerosters?leagueId=774630>
2018-12-14 16:49:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://games.espn.com/ffl/signin?redir=http%3A%2F%2Fgames.espn.com%2Fffl%2Fleaguerosters%3FleagueId%3D774630> (referer: None)
2018-12-14 16:49:04 [scrapy.core.scraper] ERROR: Spider error processing <GET http://games.espn.com/ffl/signin?redir=http%3A%2F%2Fgames.espn.com%2Fffl%2Fleaguerosters%3FleagueId%3D774630> (referer: None)
Run Code Online (Sandbox Code Playgroud)
由于某种原因,我仍然无法登录。我在这里浏览了许多不同的帖子,并尝试了许多不同的“splash:select”变体,但我似乎找不到我的问题。当我用 chrome 检查网页时,我看到了这个(密码有类似的 html):
<input type="email" placeholder="Username or Email Address" autocapitalize="none" autocomplete="on" autocorrect="off" spellcheck="false" ng-model="vm.username"
ng-pattern="/^[^<">]*$/" ng-required="true" did-disable-validate="" ng-focus="vm.resetUsername()" class="ng-pristine ng-invalid ng-invalid-required
ng-valid-pattern ng-touched" tabindex="0" required="required" aria-required="true" aria-invalid="true">
Run Code Online (Sandbox Code Playgroud)
上面的 html,我相信是用 JS 写的。所以我无法用 Scrapy 抓取它,所以,我查看了页面的源代码,我认为与 Splash 一起使用的相关 JS 代码是这样的(虽然不确定):
function authenticate(params) {
return makeRequest('POST', '/guest/login', {
'loginValue': params.loginValue,
'password': params.password
}, {
'Authorization': params.authorization,
'correlation-id': params.correlationId,
'conversation-id': params.conversationId,
'oneid-reporting': buildReportingHeader(params.reporting)
}, {
'langPref': getLangPref()
});
}
Run Code Online (Sandbox Code Playgroud)
有人可以将我推向正确的方向吗?
这里的主要问题是登录表单位于 iframe 元素内。我不知道 scrapy_splash,所以下面的 POC 代码使用了 selenium 和 beautiful soup。但机制与splash类似,你需要切换到iframe,然后当id消失时再返回。
import os
from bs4 import BeautifulSoup
from selenium import webdriver
USER = 'theUser'
PASS = 'thePassword'
fp = webdriver.FirefoxProfile()
driver = webdriver.Firefox(fp)
driver.get('http://games.espn.com/ffl/leaguerosters?leagueId=774630')
iframe = driver.find_element_by_css_selector('iframe#disneyid-iframe')
driver.switch_to.frame(iframe)
driver.find_element_by_css_selector("input[type='email']").send_keys(USER)
driver.find_element_by_css_selector("input[type='password']").send_keys(PASS)
driver.find_element_by_css_selector("button[type='submit']").click()
driver.switch_to.default_content()
soup_level1 = BeautifulSoup(driver.page_source, 'html.parser')
Run Code Online (Sandbox Code Playgroud)
要使此代码正常工作,您需要在路径中安装 Firefox 和 geckodriver 以及兼容版本。
| 归档时间: |
|
| 查看次数: |
1571 次 |
| 最近记录: |