使用 selenium 抓取页面链接总是返回有限数量的链接

kha*_*baa 4 python beautifulsoup web-scraping selenium-webdriver

我想从这个页面“https://m.aiscore.com/basketball/20210610”中抓取所有匹配链接,但只能得到限制数量的匹配:我试过这个代码:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless") 
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)

driver.maximize_window()
driver.implicitly_wait(60) 

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")    

soup = BeautifulSoup(driver.page_source, 'html.parser')

links = [i['href'] for i in soup.select('.w100.flex a')]
links_length = len(links) #always return 16
driver.quit()
Run Code Online (Sandbox Code Playgroud)

当我运行代码时,我总是只得到 16 个匹配链接,但页面有 35 个匹配。我需要获取页面中的所有匹配链接。

Ram*_*Ram 5

由于滚动时正在加载站点,我尝试一次滚动一个屏幕,直到我们需要滚动到的高度大于窗口的总滚动高度。

我使用 aset来存储匹配链接以避免添加现有的匹配链接。

在运行这个时,我能够找到所有的链接。希望这对你也有用。

import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless") 
driver = webdriver.Chrome(executable_path=r"C:\Users\User\Downloads\chromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
# Wait till the webpage is loaded
time.sleep(2)

# wait for 1sec after scrolling
scroll_wait = 1

# Gets the screen height
screen_height = driver.execute_script("return window.screen.height;")
driver.implicitly_wait(60) 

# Number of scrolls. Initially 1
ScrollNumber = 1

# Set to store all the match links
ans = set()

while True:
    # Scrolling one screen at a time until
    driver.execute_script(f"window.scrollTo(0, {screen_height * ScrollNumber})")
    ScrollNumber += 1
    
    # Wait for some time after scroll
    time.sleep(scroll_wait)
    
    # Updating the scroll_height after each scroll
    scroll_height = driver.execute_script("return document.body.scrollHeight;")
    
    # Fetching the data that we need - Links to Matches
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for j in soup.select('.w100 .flex a'):
        if j['href'] not in ans:
            ans.add(j['href'])
    # Break when the height we need to scroll to is larger than the scroll height
    if (screen_height) * ScrollNumber > scroll_height:
        break
    
    
print(f'Links found: {len(ans)}')
Run Code Online (Sandbox Code Playgroud)
Output:

Links found: 61
Run Code Online (Sandbox Code Playgroud)