Python selenium 获取“开发者工具”？网络？媒体日志

Question

Python selenium 获取“开发者工具”？网络？媒体日志

8 python firefox selenium python-3.x

我正在尝试以编程方式执行一些必须涉及获取“开发人员工具”\xe2\x86\x92network\xe2\x86\x92media 日志的操作。

\n

我就不告诉你细节了，长话短说，我需要访问数千个页面，如下所示：https://music.163.com/#/song?id=ID，其中ID等号后面是一个数字。

\n

如果你打开这样的页面，就会有一个播放按钮，该按钮会触发一个javascript，加载整个页面中没有引用的音乐文件，并播放该文件。\n（注意：你可能需要中国IP才能播放）听一些歌曲，并且需要VIP帐户才能听其他一些歌曲。）

\n

例如，此页面：https://music.163.com/#/song?id=32477986，它应该如下所示：

\n

如果您单击蓝色按钮，则会触发 javascript，并且音乐文件将由 javascript 加载并播放。该音乐文件不会成为网页中的元素，因此无法通过find_element*方法直接抓取。

\n

但我找到了一种方法来找到音乐文件的地址。

\n

在 Firefox 中，按 F12 打开检查器/“开发人员工具”，单击网络，然后单击媒体。单击蓝色按钮，然后会显示多个具有相同文件名的请求，文件名会匹配^[0-9a-f]+\\.m4a，并且域可能不同。

\n

像这样：

\n

单击任何记录，您将找到其地址，其中任何一个都可以，如下所示：

\n

我目前正在尝试找出如何以编程方式模拟这个过程。

\n

我用 Google 搜索了这个：python selenium 开发人员工具网络选项卡，但没有找到我要找的东西，这正是我的预期。我发布该链接是为了展示我的研究成果，以及 Google 如何不理解您尝试搜索的内容的含义。

\n

无论如何，我偶然发现了这一点：\n https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/

\n

并用这些进行了测试：

\n

import time\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilities\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\ncapabilities = DesiredCapabilities.CHROME\ncapabilities["goog:loggingPrefs"] = {\'performance\': "ALL"}\ndriver = webdriver.Chrome(desired_capabilities=capabilities)\nwait = WebDriverWait(driver, 15)\ndriver.get(\'https://music.163.com/#/song?id=32477986\')\niframe = driver.find_element_by_xpath(\'//iframe[@id="g_iframe"]\')\ndriver.switch_to.frame(iframe)\nwait.until(EC.visibility_of_element_located((By.XPATH, \'//div[2]/div/a[1]\')))\nplay = driver.find_element_by_xpath(\'//div[2]/div/a[1]\')\nplay.click()\ntime.sleep(10)\ndriver.get_log(\'performance\')\n

Run Code Online (Sandbox Code Playgroud)\n

它有效，但输出太宽泛，我更喜欢使用 Firefox。

\n

然后我尝试loggingPrefs使用 Google: chrome all "loggingPrefs" optionsbrowser:ALL查找所有有效选项，不幸的是，但不出所料，除了和之外，我什么也找不到driver:ALL。

\n

我找不到任何指定所有可能的开关的文档。

\n

但我想也许我已经找到了一种模式，性能是检查器/开发工具中的一个选项卡，而网络是另一个选项卡。

\n

所以我替换了两次出现的\'performance\'with\'network\'并再次运行代码：

\n

InvalidArgumentException: Message: invalid argument: log type \'network\' not found\n  (Session info: chrome=89.0.4389.90)\n

Run Code Online (Sandbox Code Playgroud)\n

这就是我得到的。

\n

无论如何，这就是我整理的：

\n

import os\nimport time\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilities\nfrom selenium.webdriver.firefox.options import Options\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\n\noptions = Options()\noptions.headless = True\npath = (os.environ[\'APPDATA\'] + \'\\Mozilla\\Firefox\\Profiles\\Selenium\').replace(\'\\\\\', \'/\')\nprofile = webdriver.FirefoxProfile(path)\nprofile.set_preference("media.volume_scale", "0.0")\n\ncapabilities = DesiredCapabilities.FIREFOX\ncapabilities["loggingPrefs"] = {\'performance\': \'ALL\'}\n\nFirefox = webdriver.Firefox(firefox_profile=profile, desired_capabilities=capabilities, options=options)\nwait = WebDriverWait(Firefox, 15)\nFirefox.get(\'https://music.163.com/#/song?id=32477986\')\niframe = Firefox.find_element_by_xpath(\'//iframe[@id="g_iframe"]\')\nFirefox.switch_to.frame(iframe)\nwait.until(EC.visibility_of_element_located((By.XPATH, \'//div[2]/div/a[1]\')))\nplay = Firefox.find_element_by_xpath(\'//div[2]/div/a[1]\')\nplay.click()\ntime.sleep(10)\nFirefox.get_log(\'performance\')\n

Run Code Online (Sandbox Code Playgroud)\n

这就是它失败的原因：

\n

WebDriverException: Message: HTTP method not allowed\n

Run Code Online (Sandbox Code Playgroud)\n

天哪，我如何使用 Python selenium 获取 Network\xe2\x86\x92Media 日志？我什至无法使日志记录首选项起作用。我发现的所有内容都使用“loggingPrefs”键，正如您所见，它不起作用。我似乎依稀记得gecko:loggingPrefs，但我无法通过谷歌搜索找到任何东西"gecko:loggingPrefs"。

\n

此评论：Getting console.log output from Firefox with Selenium提到 driver.get_log(\'browser\') 将不再起作用。但尚不清楚它是否仅适用于browser所有日志。

\n

如何获取 Firefox 检查器日志以及如何将其范围缩小到 network\xe2\x86\x92media 选项卡？

\n

如果我没有表现出足够的研究努力，我真的很抱歉，我到底如何在不使用谷歌的情况下在线研究一些东西呢？难道你还没有从自己使用 Google 的经验中了解到，Google 永远不会理解你的搜索词的含义，它只会查找包含关键字的文档，其中关键字随机散布在文档中，而结果甚至不需要包含所有关键词！

\n

谷歌确实是一个糟糕的研究工具，我真的没有什么比谷歌更好的了。因此，如果这还不够，那么我不知道有什么可以算作足够的研究工作。

\n

那么如何使用Python 3.9.5 selenium在Firefox中获取inspector\xe2\x86\x92network\xe2\x86\x92media日志？

\n

谷歌引导我来到这里，坦率地说，现场搜索引擎甚至比谷歌还要糟糕。我找不到我正在寻找的答案，这正是我在这里提出问题的原因。

\n

经过更多研究，我终于找到了一些东西：\n /sf/answers/4587699791/

\n

这个答案让我离我的目标又近了一步，但我对 javascript 一无所知，测试返回：

\n

JavascriptException: Message: Cyclic object value\n

Run Code Online (Sandbox Code Playgroud)\n

但它确实指出了正确的方向，解决方案应该涉及.execute_script()完成工作，但我不知道命令应该是什么，我尝试谷歌搜索：javascript get“devtools”“network”“media”“日志”，亲自看看它返回什么。

\n

嗯，我设法使用 Chrome 获取性能日志并将其重定向到文本文件，我将其上传到Google Drive。

\n

我已经在文件中找到了地址（Notepad++搜索.m4a），但我不知道如何以编程方式将结果过滤到与音乐文件相关的请求。

\n

我想，现在我会被 Chrome 和性能日志困住。

\n

但我真的不知道如何过滤请求以仅获取相关请求。那怎么办呢？

\n

Answer 1

小智 9

最后我自己完成了这件事，没有任何人的帮助。

\n

技巧很简单，一旦你知道该怎么做，实现起来就不难了。

\n

响应是 json 格式，所以我们需要该json模块。

\n

json的结构各不相同，但第一级键是固定的，总是三个键：level, message, timestamp。

\n

我们需要messagekey，它的 value 是一个打包在字符串中的 json 对象，所以我们需要json.loads解压它。

\n

这些打包好的json对象的结构千差万别，但总有一个message键和method键里面有一个键message。

\n

在这里，我们尝试抓取收到的媒体文件地址，长话短说，message\xe2\x86\x92 message\xe2\x86\x92method键应该等于\'Network.responseReceived\'.

\n

如果message\xe2\x86\x92 message\xe2\x86\x92 methodkey 等于\'Network.responseReceived\'，那么总会有一个message\xe2\x86\x92 message\xe2\x86\x92 params\xe2\x86\x92 \ responsexe2\x86\x92 mimeTypekey。

\n

该键存储资源的文件类型，我就不详细介绍了，我知道.mp4代表Motion Picture Expert Group-4和是一种视频格式，但这里的媒体类型应该是\'audio/mp4\'。

\n

如果满足所有有关条件，则媒体文件的地址是message\xe2\x86\x92 message\xe2\x86\x92 params\xe2\x86\x92 response\xe2\x86\x92url键的值。

\n

这是最终的代码：

\n

import json\nimport os\nimport random\nimport sys\nimport time\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilities\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\n\npath = (os.environ[\'LOCALAPPDATA\'] + \'\\\\Google\\\\Chrome\\\\User Data\')\n\noptions = webdriver.ChromeOptions()\noptions.add_argument(\'--disable-gpu\')\noptions.add_argument(\'--headless\')\noptions.add_argument(\'--log-level=3\')\noptions.add_argument(\'--mute-audio\')\noptions.add_argument(f\'--user-data-dir={path}\')\n\ncapabilities = DesiredCapabilities.CHROME\ncapabilities["goog:loggingPrefs"] = {\'performance\': \'ALL\'}\n\nChrome = webdriver.Chrome(options=options, desired_capabilities=capabilities)\nwait = WebDriverWait(Chrome, 5)\n\ndef getlink(addr):\n    Chrome.get(addr)\n    iframe = Chrome.find_element_by_xpath(\'//iframe[@id="g_iframe"]\')\n    Chrome.switch_to.frame(iframe)\n    wait.until(EC.visibility_of_element_located((By.XPATH, \'//div[2]/div/a[1]\')))\n    play = Chrome.find_element_by_xpath(\'//div[2]/div/a[1]\')\n    play.click()\n    time.sleep(5)\n    logs = Chrome.get_log(\'performance\')\n    addresses = []\n    for i in logs:\n        log = json.loads(i[\'message\'])\n        if log[\'message\'][\'method\'] == \'Network.responseReceived\':\n            if log[\'message\'][\'params\'][\'response\'][\'mimeType\'] == \'audio/mp4\':\n                addresses.append(log[\'message\'][\'params\'][\'response\'][\'url\'])\n    check = set([i.split(\'/\')[-1] for i in addresses])\n    if len(check) == 1:\n        return random.choice(addresses)\n\nif __name__ == \'__main__\':\n    print(getlink(sys.argv[1]))\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	4 年，4 月前
查看次数：	4711 次
最近记录：	3 年，9 月前