kha*_*baa 4 python beautifulsoup web-scraping selenium-webdriver
我想从这个页面“https://m.aiscore.com/basketball/20210610”中抓取所有匹配链接,但只能得到限制数量的匹配:我试过这个代码:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)
url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(60)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = [i['href'] for i in soup.select('.w100.flex a')]
links_length = len(links) #always return 16
driver.quit()
Run Code Online (Sandbox Code Playgroud)
当我运行代码时,我总是只得到 16 个匹配链接,但页面有 35 个匹配。我需要获取页面中的所有匹配链接。
由于滚动时正在加载站点,我尝试一次滚动一个屏幕,直到我们需要滚动到的高度大于窗口的总滚动高度。
我使用 aset来存储匹配链接以避免添加现有的匹配链接。
在运行这个时,我能够找到所有的链接。希望这对你也有用。
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=r"C:\Users\User\Downloads\chromedriver.exe", options=options)
url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
# Wait till the webpage is loaded
time.sleep(2)
# wait for 1sec after scrolling
scroll_wait = 1
# Gets the screen height
screen_height = driver.execute_script("return window.screen.height;")
driver.implicitly_wait(60)
# Number of scrolls. Initially 1
ScrollNumber = 1
# Set to store all the match links
ans = set()
while True:
# Scrolling one screen at a time until
driver.execute_script(f"window.scrollTo(0, {screen_height * ScrollNumber})")
ScrollNumber += 1
# Wait for some time after scroll
time.sleep(scroll_wait)
# Updating the scroll_height after each scroll
scroll_height = driver.execute_script("return document.body.scrollHeight;")
# Fetching the data that we need - Links to Matches
soup = BeautifulSoup(driver.page_source, 'html.parser')
for j in soup.select('.w100 .flex a'):
if j['href'] not in ans:
ans.add(j['href'])
# Break when the height we need to scroll to is larger than the scroll height
if (screen_height) * ScrollNumber > scroll_height:
break
print(f'Links found: {len(ans)}')
Run Code Online (Sandbox Code Playgroud)
Output:
Links found: 61
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
223 次 |
| 最近记录: |