使用 Selenium 获取 BLOB url 时出错

我尝试使用 Python 中的 Selenium 和脚本注入来获取存储在内存中的 blob 内容。

这是代码：

from selenium import webdriver
import base64
from bs4 import BeautifulSoup

def download_blob(driver, uri):
    result = driver.execute_async_script("""
        var uri = arguments[0];
        var callback = arguments[arguments.length-1];
        var toBase64 = function(buffer){for(var r,n=new Uint8Array(buffer),t=n.length,a=new Uint8Array(4*Math.ceil(t/3)),i=new Uint8Array(64),o=0,c=0;64>c;++c)i[c]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".charCodeAt(c);for(c=0;t-t%3>c;c+=3,o+=4)r=n[c]<<16|n[c+1]<<8|n[c+2],a[o]=i[r>>18],a[o+1]=i[r>>12&63],a[o+2]=i[r>>6&63],a[o+3]=i[63&r];return t%3===1?(r=n[t-1],a[o]=i[r>>2],a[o+1]=i[r<<4&63],a[o+2]=61,a[o+3]=61):t%3===2&&(r=(n[t-2]<<8)+n[t-1],a[o]=i[r>>10],a[o+1]=i[r>>4&63],a[o+2]=i[r<<2&63],a[o+3]=61),new TextDecoder("ascii").decode(a)};
        var xhr = new XMLHttpRequest();
        xhr.responseType = 'arraybuffer';
        xhr.onload = function(){ callback(toBase64(xhr.response)) };
        xhr.onerror = function(){ callback(xhr.status) };
        xhr.open('GET', uri);
        xhr.send();
        """, uri)
    print(uri, result)

    if type(result) == int :
        raise Exception("Request failed with status %s" % result)

    return base64.b64decode(result)

options = webdriver.ChromeOptions()
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36')
driver = webdriver.Chrome(options=options)
url = 'https://www.youtube.com/watch?v=KBtk5FUeJbk'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html5lib')
blob_url = soup.find('video', attrs={'class': 'video-stream html5-main-video'})['src']
byte_stream = download_blob(driver, blob_url)

Run Code Online (Sandbox Code Playgroud)

输出：

blob:https://www.youtube.com/5e3f1fab-3839-45a1-bb62-3582635b9e7d 0

Traceback (most recent call last):
  File "C:\Users\*****\Desktop\blob-download.py", line 32, in <module>
    byte_stream = download_blob(driver, blob_url)
  File "C:\Users\*****\Desktop\blob-download.py", line 20, in download_blob
    raise Exception("Request failed with status %s" % result)
Exception: Request failed with status 0

Run Code Online (Sandbox Code Playgroud)

该result变量返回整数 0，表示请求失败。我不明白出了什么问题。内存中的 blob 的至少某些部分应显示为字节。

我将上面的代码作为参考，如果 URL 以“blob:”开头，如何使用 Python 3/Selenium 下载图像？。答案提到我需要从创建该 blob 的页面中获取该 blob url，因此，我使用BeautifulSoup而不是硬编码 blob url 来抓取 blob url。例子：

byte_stream = download_blob(driver, 'blob:https://www.youtube.com/5e3f1fab-3839-45a1-bb62-3582635b9e7d') # this would definitely not work

Run Code Online (Sandbox Code Playgroud)

我什至尝试更改网站，因为我认为 YouTube 可能会对抓取内容有一些严格的政策，但仍然没有成功。所有其他网站都给出了相同的回应。

也欢迎对某些 JavaScript 替代方案的见解。

归档时间：	5 年，11 月前
查看次数：	1826 次
最近记录：	5 年，2 月前