Selenium Python - 获取所有已加载 URL 的列表(图像、脚本、样式表等)

vat*_*sug 6 python selenium selenium-chromedriver selenium-webdriver

当 Google Chrome 通过 Selenium 加载网页时,它可能会加载页面所需的其他文件,例如 from<img src="example.com/a.png"><script src="example.com/a.js">tags。此外,CSS 文件。

如何获取浏览器加载页面时下载的所有 URL 的列表?(以编程方式,在 Python 中使用 Selenium 和 chromedriver)也就是说,Chrome 中开发人员工具的“网络”选项卡中显示的文件列表(显示下载的文件列表)。

使用 Selenium、chromedriver 的示例代码:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/x-www-browser"
driver = webdriver.Chrome("./chromedriver", chrome_options=options)
# Load some page
driver.get("https://example.com")
# Now, how do I see a list of downloaded URLs that took place when loading the page above?
Run Code Online (Sandbox Code Playgroud)

vat*_*sug 3

继续 @GPT14 在他的回答中的建议,我编写了一个小脚本,它完全完成了我想要的任务,并打印了某个页面加载的 URL 列表。

这使用 BrowserMob 代理。非常感谢@GPT14 建议使用它——它非常适合我们的目的。我已经更改了他的答案中的代码,并将其调整为 Google Chrome webdriver 而不是 Firefox。我还扩展了该脚本,以便它遍历 HAR JSON 输出并列出所有请求 URL。请记住根据您的需要调整以下选项。

from browsermobproxy import Server
from selenium import webdriver

# Purpose of this script: List all resources (URLs) that
# Chrome downloads when visiting some page.

### OPTIONS ###
url = "https://example.com"
chromedriver_location = "./chromedriver" # Path containing the chromedriver
browsermobproxy_location = "/opt/browsermob-proxy-2.1.4/bin/browsermob-proxy" # location of the browsermob-proxy binary file (that starts a server)
chrome_location = "/usr/bin/x-www-browser"
###############

# Start browsermob proxy
server = Server(browsermobproxy_location)
server.start()
proxy = server.create_proxy()

# Setup Chrome webdriver - note: does not seem to work with headless On
options = webdriver.ChromeOptions()
options.binary_location = chrome_location
# Setup proxy to point to our browsermob so that it can track requests
options.add_argument('--proxy-server=%s' % proxy.proxy)
driver = webdriver.Chrome(chromedriver_location, chrome_options=options)

# Now load some page
proxy.new_har("Example")
driver.get(url)

# Print all URLs that were requested
entries = proxy.har['log']["entries"]
for entry in entries:
    if 'request' in entry.keys():
        print entry['request']['url']

server.stop()
driver.quit()
Run Code Online (Sandbox Code Playgroud)