cbp*_*bp2 18 python selenium phantomjs selenium-webdriver
尝试屏幕抓取网站而不必在python脚本中启动实际的浏览器实例(使用Selenium).我可以用Chrome或Firefox做到这一点 - 我已经尝试了它并且它有效 - 但我想使用PhantomJS所以它是无头的.
代码如下所示:
import sys
import traceback
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87"
)
try:
# Choose our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#browser = webdriver.PhantomJS()
#browser = webdriver.Firefox()
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
browser.get("https://www.whatever.com")
# For debug, see what we got back
html_source = browser.page_source
with open('out.html', 'w') as f:
f.write(html_source)
# PROCESS THE PAGE (code removed)
except Exception, e:
browser.save_screenshot('screenshot.png')
traceback.print_exc(file=sys.stdout)
finally:
browser.close()
Run Code Online (Sandbox Code Playgroud)
输出仅仅是:
<html><head></head><body></body></html>
Run Code Online (Sandbox Code Playgroud)
但是当我使用Chrome或Firefox选项时,它可以正常工作.我想也许这个网站根据用户代理返回了垃圾,所以我试着把它搞砸了.没有不同.
我错过了什么?
更新:我会尽量保持下面的代码段更新,直到它工作.以下是我目前正在尝试的内容.
import sys
import traceback
import time
import re
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")
try:
# Set up our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
print "getting web page..."
browser.get("https://www.website.com")
# Need to wait for the page to load
timeout = 10
print "waiting %s seconds..." % timeout
wait = WebDriverWait(browser, timeout)
element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
print "done waiting. Response:"
# Rest of code snipped. Fails as "wait" above.
Run Code Online (Sandbox Code Playgroud)
Rau*_*har 29
我遇到了同样的问题,没有多少代码让驱动程序等待有所帮助.
问题是https网站上的SSL加密,忽略它们就可以解决问题.
将PhantomJS驱动程序称为:
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
Run Code Online (Sandbox Code Playgroud)
这解决了我的问题.
您需要等待页面加载。通常,这是通过使用显式等待来等待关键元素在页面上出现或可见来完成的。例如:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# ...
browser.get("https://www.whatever.com")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))
html_source = browser.page_source
# ...
Run Code Online (Sandbox Code Playgroud)
在这里,我们将等待最多 10 秒,让div元素class="content"变得可见,然后再获取页面源。
此外,您可能需要忽略 SSL 错误:
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
Run Code Online (Sandbox Code Playgroud)
不过,我很确定这与PhantomJS. bugtracker中有一个开放的票证phantomjs:
| 归档时间: |
|
| 查看次数: |
14948 次 |
| 最近记录: |