M4c*_*k13 5 python beautifulsoup web-scraping python-requests
我正在尝试为字幕抓取一个YouTube页面.不幸的是,它没有按要求加载所有内容.我很想知道我哪里出错了.
请求参数:
https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=Nxb2s2Mv6Pw&lang=en&bl=vmp&forceedit=captions&tab=captions
Run Code Online (Sandbox Code Playgroud)
所以我发现这是唯一的Url-ID ...... Nxb2s2Mv6Pw我可以相应地替换它.
如果我运行下面的代码,它不会捕获<textarea yt-uix-form-input-textarea ...>我需要它找到的标记.
我拼命想避免使用Selenium捕获它,因为我有很多链接需要迭代并重复这个过程.正如你可以通过下面的代码告诉我的,我试图加入一个延迟的时间等待页面加载,但没有.
import os
import codecs
import sys
import requests
from bs4 import BeautifulSoup
channel = 'https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=dto4koj5DTA&lang=en'
s = requests.Session()
time.sleep(5)
# s.headers['User-Agent'] = USER_AGENT
r = s.get(channel)
time.sleep(5)
html = r.text
soup = BeautifulSoup(html, 'lxml')
for i in soup.find_all('div'):
print(i)
Run Code Online (Sandbox Code Playgroud)
请指教.
我尝试使用requestsand 抓取页面lxml但是当迭代脚本中的标签时,我在页面上找不到字幕(字幕未显示在脚本中的textarea标签)这可能是因为YouTube使用javascript加载字幕.
Python的请求库不支持javascript.但是你有几个选择:
使用selenium来刮字幕(你说你宁愿不这样做.)
通过浏览器查看POST和GET请求,并尝试向您追踪javascript的网址发送所需的请求参数(如果身份验证或动态令牌用于参数,则可能无法始终有效)
(这似乎是最简单/最可靠的方法.)
youtube-dl是一个命令行实用程序,但您也可以根据github上的文档导入它.
有几种方法可以解决这个问题.我将使用您在帖子中指出的视频作为我的示例:
youtube-dl --write-sub --skip-download --sub-lang en https://www.youtube.com/watch?v=Nxb2s2Mv6Pw
Run Code Online (Sandbox Code Playgroud)
话虽如此,你可以在python中创建一个函数来调用命令:
import os
def download_subs(video_url, lang="en"):
cmd = [
"youtube-dl",
"--skip-download",
"--write-sub",
"--sub-lang",
lang,
video_url
]
os.system(" ".join(cmd))
url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"
download_subs(url)
Run Code Online (Sandbox Code Playgroud)
**或者,您可以youtube_dl直接从python 导入并从那里使用它:**
import youtube_dl
def download_subs(url, lang="en"):
opts = {
"skip_download": True,
"writesubtitles": "%(name)s.vtt",
"subtitlelangs": lang
}
with youtube_dl.YoutubeDL(opts) as yt:
yt.download([url])
url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"
download_subs(url)
Run Code Online (Sandbox Code Playgroud)
这将在名为的工作目录中创建一个文件
CNN 'Exposed' In Controversial Secret Video and Anita Sarkeesian's 'Punishment'...-Nxb2s2Mv6Pw.en.vtt
Run Code Online (Sandbox Code Playgroud)
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:01.500
You beautiful bastards
00:00:01.500 --> 00:00:07.200
Hope you having a fantastic Tuesday welcome back to the Philip Defranco show and let's just jump into it the first thing
00:00:07.200 --> 00:00:11.519
I want to talk about today one of the most requested stories of the day today is an update on the
00:00:11.889 --> 00:00:13.650
Craziness out of Vidcon yesterday
00:00:13.650 --> 00:00:19.350
Specifically we're talking about creator and panelist Anita Sarkeesian being on a panel calling someone in the crowd
...
...
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1378 次 |
| 最近记录: |