Avy*_*Wam 6 python xmlhttprequest scrapy web-scraping playwright
具体来说,我想从这个网站提取视频:equidia。第一个问题是当我启动scrapy shell https://www.equidia.fr/courses/2022-07-31/R1/C1 -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36". 我检查view(response)并注意到视频的容器存在。我可以点击播放,但没有流量,因此无法提取视频。我想蜘蛛在检查过程中的能力不会比我在网络浏览器中的能力更强。
更新:感谢 palpitus_on_fire 我检查了更多XHR类型请求,而我只专注于media类型。
我感兴趣的 XHR 请求是:https://equidia-vodce-p-api.hexaglobe.net/video/info/20220731_35186008_00000HR/01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw/0b94eb69d04aa18ca33ba5e8767b1a85264c35b6acefc8bfbd7eabb5e259cf2c/1659574478
它返回一个响应 json 文件:
{
"mp4": {
"240": "https://private4-stream-ovp.vide.io/dl/0bee7ff9bce79419fb27bac012c5d090/62e904dc/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0",
"360": "https://private3-stream-ovp.vide.io/dl/7c53dc70ae5cc46d7e2c9d43e3dfc685/62e904dc/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0",
"480": "https://private3-stream-ovp.vide.io/dl/4261c5bd7b48595535710e1dc34e8f93/62e904dc/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0",
"540": "https://private3-stream-ovp.vide.io/dl/d40592e39e27931f20245edde9f30152/62e904dc/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0",
"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"
}
"hls": "https://private3-stream-ovp.vide.io/dl/6abf08b12203c84390a01fc4de65a4d5/62e904dc/equidia_2752987/videos/hls/63/03/S506303/V556523/playlist.m3u8?v=1",
"master": "https://private8-stream-ovp.vide.io/dl/6830b3a5039eef57f844bb08cd252584/62e904dc/equidia_2752987/videos/master/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523.mp4?v=0"
}
Run Code Online (Sandbox Code Playgroud)
"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"包含我想要的视频和分辨率。
现在主要有两个问题:
XHR请求并准确地获取以 开头的请求equidia-vodce-p-api.hexaglobe.net?流行的解决方案是使用像selenium或 这样的网络测试器playwright,但我都不知道。如果可能的话我想获得这种形式的代码:
class VideoPageScraper(scrapy.Spider):
def __init__(self):
self.start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']
def parse(self, response):
# code to catch the XHR request desired
urlvideo = 'https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'
yield scrapy.Request(url=urlvideo, callback=self.obtainvideo)
def obtainvideo(self, response):
with #open('/home/avy/binary.mp4', returns'wb') as wfile:
wfile.write(response.body)
Run Code Online (Sandbox Code Playgroud)
更新
可以生成 XHR 的 URL 吗?
查看 Firefox 的开发工具,我试图了解所需的 XHR 请求是如何完成的。
脚本main.d685b66f8277bab3376b.js生成用于生成 XHR url 的元素(据我所知,位于第 1761 行)。
Object { endpoint: "https://equidia-vodce-p-api.hexaglobe.net/video/info", publicKey: "01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw", hash: "c79502b086116f30262dfbf7675225dc1c8bfb4c144df09da9a34b9106fccd39", expiresAt: "1659608062" }
Run Code Online (Sandbox Code Playgroud)
然后https://www.equidia.fr/polyfills.de328f7d13aac150968d.js使用 XHR 的 url 启动 fecth。
您是否认为有一种方法可以根据上面所示的脚本获取生成的对象,以便发出所需的 XHR 请求?
附录与剧作家尝试(无头浏览器)
我找到了一个可以获取XHR我正在寻找的请求的库。playwright是一个网络测试器,它scrapy与scrapy-playwright.
受这些例子的启发,我尝试了以下代码:
from playwright.sync_api import sync_playwright
url = 'https://www.equidia.fr/courses/2022-07-31/R1/C1'
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("equidia-vodce-p-api" in response.url):
item = response.json()['mp4']
print(item)
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle", timeout=900000)
page.context.close()
browser.close()
Exception in callback SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104
handle: <Handle SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104>
Traceback (most recent call last):
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 105, in callback
g_self.switch()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 124, in <lambda>
lambda params: self._on_response(
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 396, in _on_response
page.emit(Page.Events.Response, response)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 176, in emit
handled = self._call_handlers(event, args, kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 154, in _call_handlers
self._emit_run(f, args, kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 50, in _emit_run
self.emit("error", exc)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 179, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 139, in _emit_handle_potential_error
raise error
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 48, in _emit_run
coro: Any = f(*args, **kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 88, in wrapper_func
return handler(
File "<input>", line 7, in handle_response
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 601, in json
self._sync("response.json", self._impl_obj.json())
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
return task.result()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 362, in json
return json.loads(await self.text())
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 358, in text
content = await self.body()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 354, in body
binary = await self._channel.send("body")
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
return await self.inner_send(method, params, False)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Response body is unavailable for redirect responses
{'240': 'https://private4-stream-ovp.vide.io/dl/ec808fddd1b10659cb7d3bfeb7cba591/62e9abc8/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0', '360': 'https://private3-stream-ovp.vide.io/dl/7c5cc10f2bb466fd85d7109c698500d3/62e9abc8/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0', '480': 'https://private3-stream-ovp.vide.io/dl/c65277c3e385fb825e0ea4f5c1447d00/62e9abc8/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0', '540': 'https://private3-stream-ovp.vide.io/dl/036f88b63a1157ce51bf68a7e60791cc/62e9abc8/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0', '720': 'https://private8-stream-ovp.vide.io/dl/459a1e569da63e69031d1ecbf1f87ed0/62e9abc8/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'}
Run Code Online (Sandbox Code Playgroud)
我设法获得了我想要的作品,但有很多例外,我不知道出了什么问题。
小智 1
这是一个完全有效的解决方案,通过使用scrapy-playwright,我得到了以下问题61和用户配置文件@lime-n 的想法。
我们下载发送的请求并将其存储到使用工具和 的xhr字典中。我为此提供了集成工具。playwrightscrapy-playwrightplaywright_page_event_handlersplaywright
根据视频大小,下载需要几分钟。
主要.py
import scrapy
from playwright.async_api import Response as PlaywrightResponse
import jsonlines
import pandas as pd
import json
from video.items import videoItem
headers = {
'authority': 'api.equidia.fr',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'content-type': 'application/json',
'origin': 'https://www.equidia.fr',
'referer': 'https://www.equidia.fr/',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
class videoDownloadSpider(scrapy.Spider):
name = 'video'
start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
#callback = self.parse,
meta = {
'playwright':True,
'playwright_page_event_handlers':{
"response": "handle_response",
}
}
)
async def handle_response(self, response: PlaywrightResponse) -> None:
"""
We can grab the post data with response.request.post - there are three different types for different needs.
The method below helps grab those resource types of 'xhr' and 'fetch' until I can work out how to only send these to the download request.
"""
self.logger.info(f'test the log of data: {response.request.resource_type, response.request.url, response.request.method}')
jl_file = "videos.jl"
data = {}
if response.request.resource_type == "xhr":
if response.request.method == "GET":
if 'videos' in response.request.url:
data['resource_type']=response.request.resource_type,
data['request_url']=response.request.url,
data['method']=response.request.method
with jsonlines.open(jl_file, mode='a') as writer:
writer.write(data)
def parse(self, response):
video = pd.read_json('videos.jl', lines=True)
print('KEST: %s' % video['request_url'][0][0])
yield scrapy.FormRequest(
url = video['request_url'][0][0],
headers=headers,
callback = self.parse_video_json
)
def parse_video_json(self, response):
another_url = response.json().get('video_url')
yield scrapy.FormRequest(
another_url,
headers=headers,
callback=self.extract_videos
)
def extract_videos(self, response):
videos = response.json().get('mp4')
for keys, vids in videos.items():
loader = videoItem()
loader['video'] = [vids]
yield loader
Run Code Online (Sandbox Code Playgroud)
项目.py
class videoItem(scrapy.Item):
video = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
管道.py
class downloadFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, item=None):
file = request.url.split('/')[-1]
video_file = f"{file}.mp4"
return video_file
Run Code Online (Sandbox Code Playgroud)
设置.py
from pathlib import Path
import os
BASE_DIR = Path(__file__).resolve().parent.parent
FILES_STORE = os.path.join(BASE_DIR, 'videos')
FILES_URLS_FIELD = 'video'
FILES_RESULT_FIELD = 'results'
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
ITEM_PIPELINES = {
'video.pipelines.downloadFilesPipeline': 150,
}
Run Code Online (Sandbox Code Playgroud)