小编Rob*_*xon的帖子

使用Selenium Webdriver和Python获取完全呈现的HTML

我正在尝试使用Selenium webdriver在Python中构建一个web爬虫，但是当我从webdriver检索网站源代码时，无法访问所需的信息。

我认为问题在于，一旦最初从服务器下载了页面，就会通过JavaScript将内容添加到页面。运行时browser.page_source，在添加此内容之前，我先获取页面的源代码。我想知道添加JavaScript加载的内容后是否可以获得页面的源代码（换句话说，当我使用Inspect Element查看页面时看到的内容）。

这是我正在使用的基本Python脚本：

from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://www.opportunities.auckland.ac.nz")
print(browser.page_source)

Run Code Online (Sandbox Code Playgroud)

当我运行上面的脚本时，我获得了在浏览器中查看页面源代码时看到的页面源代码（即，当使用inspect元素查看代码时，看不到其他内容）。

我尝试过的事情

time.sleep(10)在我访问源代码时，如果页面没有完全加载，请在各个位置添加。
get_attribute("innerHTML")在身体上使用。
使用execute_script()让JS运行。
使用execute_script()来使JS脚本一个接一个地运行。

如果有人可以首先告诉我这是否可行，以及是否将我指出正确的方向，那将是很好的。谢谢。

更新1

尝试Piotrek的解决方案时，我得到以下输出：

Warning (from warnings module):
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py", line 49
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
<html><head></head><body></body></html>

Run Code Online (Sandbox Code Playgroud)

不幸的是，这似乎行不通。

javascript python iframe selenium webdriverwait

Rob*_*xon

2018 09-20

5
推荐指数

1
解决办法

3309
查看次数