使用 python 从单个 youtube 播放列表链接中提取单个链接

Question

使用 python 从单个 youtube 播放列表链接中提取单个链接

the*_*ate 6 python beautifulsoup youtube-api web-scraping python-3.x

我需要一个 python 脚本，它可以链接到单个 youtube 播放列表，然后给出一个列表，其中包含播放列表中各个视频的链接。

我意识到几年前也有人问过同样的问题，但有人问它 python2.x 并且答案中的代码无法正常工作。它们非常奇怪，它们有时会工作，但偶尔会输出空输出（也许那里使用的某些软件包已更新，我不知道）。我在下面包含了这些代码之一。

如果您不相信，请多次运行此代码，您偶尔会收到空列表，但大多数情况下它会分解播放列表。

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://www.youtube.com/playlist?list=PL3D7BFF1DDBDAAFE5')
page = r.text
soup=bs(page,'html.parser')
res=soup.find_all('a',{'class':'pl-video-title-link'})
for l in res:
    print(l.get("href"))

Run Code Online (Sandbox Code Playgroud)

在某些播放列表的情况下，代码根本不起作用。

此外，如果 beautifulsoup 不能完成这项工作，任何其他流行的 Python 库都可以。

Answer 1

Ber*_*tel 17

似乎 youtube 有时会加载不同版本的页面，有时会像您预期的那样使用带有pl-video-title-linkclass 的链接组织 html ：

<td class="pl-video-title">
   <a class="pl-video-title-link yt-uix-tile-link yt-uix-sessionlink  spf-link " dir="ltr" href="/watch?v=GtWXOzsD5Fw&amp;list=PL3D7BFF1DDBDAAFE5&amp;index=101&amp;t=0s" data-sessionlink="ei=TJbjXtC8NYri0wWCxarQDQ&amp;feature=plpp_video&amp;ved=CGoQxjQYYyITCNCSmqHD_OkCFQrxtAodgqIK2ij6LA">
   Android Application Development Tutorial - 105 - Spinners and ArrayAdapter
   </a>
   <div class="pl-video-owner">
      de <a href="/user/thenewboston" class=" yt-uix-sessionlink      spf-link " data-sessionlink="ei=TJbjXtC8NYri0wWCxarQDQ&amp;feature=playlist&amp;ved=CGoQxjQYYyITCNCSmqHD_OkCFQrxtAodgqIK2ij6LA" >thenewboston</a>
   </div>
   <div class="pl-video-bottom-standalone-badge">
   </div>
</td>

Run Code Online (Sandbox Code Playgroud)

有时将数据嵌入 JS 变量并动态加载：

window["ytInitialData"] = { .... very big json here .... };

Run Code Online (Sandbox Code Playgroud)

对于第二个版本，除非您想使用 selenium 之类的工具在页面加载后抓取内容，否则您将需要使用正则表达式来解析 Javascript。

最好的方法是使用官方 API 直接获取播放列表项：

转到Google Developer Console，搜索 Youtube Data API / 启用 Youtube Data API v3

点击创建凭证/Youtube Data API v3/公共数据
或者（对于凭据创建）转到凭据/创建凭据/API 密钥

为 python 安装 google api 客户端：
```
  pip3 install --upgrade google-api-python-client
```
Run Code Online (Sandbox Code Playgroud)

在下面的脚本中使用 API 密钥。此脚本使用 id 获取播放列表的播放列表项目PL3D7BFF1DDBDAAFE5，使用分页获取所有项目，并从 videoId 和 playlistID 重新创建链接：

import googleapiclient.discovery
from urllib.parse import parse_qs, urlparse

#extract playlist id from url
url = 'https://www.youtube.com/playlist?list=PL3D7BFF1DDBDAAFE5'
query = parse_qs(urlparse(url).query, keep_blank_values=True)
playlist_id = query["list"][0]

print(f'get all playlist items links from {playlist_id}')
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey = "YOUR_API_KEY")

request = youtube.playlistItems().list(
    part = "snippet",
    playlistId = playlist_id,
    maxResults = 50
)
response = request.execute()

playlist_items = []
while request is not None:
    response = request.execute()
    playlist_items += response["items"]
    request = youtube.playlistItems().list_next(request, response)

print(f"total: {len(playlist_items)}")
print([ 
    f'https://www.youtube.com/watch?v={t["snippet"]["resourceId"]["videoId"]}&list={playlist_id}&t=0s'
    for t in playlist_items
])

Run Code Online (Sandbox Code Playgroud)

输出：

get all playlist items links from PL3D7BFF1DDBDAAFE5
total: 195
[
    'https://www.youtube.com/watch?v=SUOWNXGRc6g&list=PL3D7BFF1DDBDAAFE5&t=0s', 
    'https://www.youtube.com/watch?v=857zrsYZKGo&list=PL3D7BFF1DDBDAAFE5&t=0s', 
    'https://www.youtube.com/watch?v=Da1jlmwuW_w&list=PL3D7BFF1DDBDAAFE5&t=0s',
    ...........
    'https://www.youtube.com/watch?v=1j4prh3NAZE&list=PL3D7BFF1DDBDAAFE5&t=0s', 
    'https://www.youtube.com/watch?v=s9ryE6GwhmA&list=PL3D7BFF1DDBDAAFE5&t=0s'
]

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，7 月前
查看次数：	10242 次
最近记录：	4 年，6 月前