it_*_*ure 6 youtube selenium google-chrome python-3.x shady-dom
该网页显示有 702 条评论。
目标 youtube 示例

我写了一个函数get_total_youtube_comments(url),很多代码是从github上的项目复制过来的。
def get_total_youtube_comments(url):
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")
driver = webdriver.Chrome(options=options,executable_path='/usr/bin/chromedriver')
wait = WebDriverWait(driver,60)
driver.get(url)
SCROLL_PAUSE_TIME = 2
CYCLES = 7
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)
html.send_keys(Keys.PAGE_DOWN)
time.sleep(SCROLL_PAUSE_TIME * 3)
for i in range(CYCLES):
html.send_keys(Keys.END)
time.sleep(SCROLL_PAUSE_TIME)
comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
return all_comments
Run Code Online (Sandbox Code Playgroud)
尝试解析示例网页上的所有评论https://www.youtube.com/watch?v=N0lxfilGfak。
url='https://www.youtube.com/watch?v=N0lxfilGfak'
list = get_total_youtube_comments(url)
Run Code Online (Sandbox Code Playgroud)
它可以得到一些评论,只是所有评论中的一小部分。
len(list)
60
Run Code Online (Sandbox Code Playgroud)
60远远少于702,如何使用硒在youtube中获得所有评论?
@supputuri,我可以用您的代码提取所有注释。
comments_list = driver.find_elements_by_xpath("//*[@id='content-text']")
len(comments_list)
709
print(driver.find_element_by_xpath("//h2[@id='count']").text)
717 Comments
comments_list[-1].text
'mistake at 23:11 \nin NOT it should return false if x is true.'
comments_list[0].text
'Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Python Course curriculum, Visit our Website: Use code "YOUTUBE20" to get Flat 20% off on this training.'
Run Code Online (Sandbox Code Playgroud)
为什么评论数量是709而不是页面显示的717?
您收到的评论数量有限,因为 YouTube 会在您继续向下滚动时加载评论。该视频还剩下大约 394 条评论,您必须首先确保所有评论都已加载,然后还全部展开,View Replies以便达到最大评论数。
注意:我能够使用以下代码行获得 700 条评论。
# get the last comment
lastEle = driver.find_element_by_xpath("(//*[@id='content-text'])[last()]")
# scroll to the last comment currently loaded
lastEle.location_once_scrolled_into_view
# wait until the comments loading is done
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))
# load all comments
while lastEle != driver.find_element_by_xpath("(//*[@id='content-text'])[last()]"):
lastEle = driver.find_element_by_xpath("(//*[@id='content-text'])[last()]")
driver.find_element_by_xpath("(//*[@id='content-text'])[last()]").location_once_scrolled_into_view
time.sleep(2)
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))
# open all replies
for reply in driver.find_elements_by_xpath("//*[@id='replies']//paper-button[@class='style-scope ytd-button-renderer'][contains(.,'View')]"):
reply.location_once_scrolled_into_view
driver.execute_script("arguments[0].click()",reply)
time.sleep(5)
WebDriverWait(driver, 30).until(
EC.invisibility_of_element((By.CSS_SELECTOR, "div.active.style-scope.paper-spinner")))
# print the total number of comments
print(len(driver.find_elements_by_xpath("//*[@id='content-text']")))
Run Code Online (Sandbox Code Playgroud)
有几件事:
https://www.youtube.com/watch?v=N0lxfilGfak,除非用户在Viewport 中滚动以下元素,否则评论不会呈现。评论在:
<!--css-build:shady-->
Run Code Online (Sandbox Code Playgroud)
适用,Polymer CSS Builder用于应用 Polymer 的 CSS Mixin shim 和 ShadyDOM 范围。所以一些运行时工作仍然需要在默认设置下转换 CSS 选择器。
考虑到上述因素,这里有一个检索所有评论的解决方案:
代码块:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=N0lxfilGfak')
driver.execute_script("return scrollBy(0, 400);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
comments = []
my_length = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[@class='style-scope ytd-comment-renderer' and @id='content-text'][@slot='content']"))))
while True:
try:
driver.execute_script("window.scrollBy(0,800)")
time.sleep(5)
comments.append([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[@class='style-scope ytd-comment-renderer' and @id='content-text'][@slot='content']")))])
except TimeoutException:
driver.quit()
break
print(comment)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1790 次 |
| 最近记录: |