Luc*_*zen 1 html javascript python beautifulsoup
我正在使用以下代码<script>...</script>从网页获取所有内容(请参见代码中的url):
import urllib2
from bs4 import BeautifulSoup
import re
import imp
url = "http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
script = soup.find_all("script")
print script #just to check the output of script
Run Code Online (Sandbox Code Playgroud)
但是,BeautifulSoup会在网页的源代码(镶边中为Ctrl + U)内搜索。但是,我想在网页的元素代码(Chrome中为Ctrl + Shift + I)内进行BeautifulSoup搜索。
我希望它这样做是因为我真正感兴趣的代码是元素代码,而不是源代码。
首先要了解的是,浏览器BeautifulSoup也不urllib2是。urllib2只会获取/下载您的初始“静态”页面-它无法像真正的浏览器那样执行JavaScript。因此,您将始终获得“查看页面源”内容。
要解决您的问题selenium,请通过启动真实的浏览器,等待页面加载,获取.page_source并将其传递BeautifulSoup给解析:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)
Run Code Online (Sandbox Code Playgroud)
这是一般的方法,但是您的情况有所不同-有一个iframe包含视频播放器的元素。如果要访问中的script元素iframe,则需要切换到它,然后获取.page_source:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1475 次 |
| 最近记录: |