MIT*_*THU 7 python selenium web-scraping python-3.x selenium-webdriver
I'm trying to extract the tabular contents available on a graph in a webpage. The content of those tables are only visible when someone hovers his cursor within the area. One such table is this one.
表格在其中的标题为EPS consensus revisions : last 18 months。
到目前为止,我已经尝试过:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.marketscreener.com/SUNCORP-GROUP-LTD-6491453/revisions/"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)
for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#graphRevisionBNAeec span > table tr"))):
    data = [item.text for item in items.find_elements_by_css_selector("td")]
    print(data)
driver.quit()
当我运行上面的脚本时,它会raise TimeoutException(message, screen, stacktrace):selenium.common.exceptions.TimeoutException: Message:指向该for items in wait.until()行引发错误。
多个表中的一个表的输出应如下所示:
Period: Thursday, Aug 22, 2019
Number of upgrading estimates: 0
Number of unchanged estimates: 7
Number of Downgrading estimates: 0
High Value: 0.90 AUD
Mean Value: 0.85 AUD
Low Value: 0.77 AUD
如何从该图中获取这些表的内容?
编辑:我仍然期待纯粹基于任何浏览器模拟器的任何解决方案。
直接查询网站的后端比使用硒来刮除前端要好得多,这有以下三个重要原因:
速度:直接使用API可以大大,更快,更高效,因为它仅获取您需要的数据,而无需等待javascript运行或像素渲染,并且没有运行webdriver的开销。
稳定性:通常,对前端的更改要比对后端的更改更为频繁且难以遵循。如果您的代码依赖于网站的前端,则当他们进行一些UI更改时,它可能会很快停止工作。
准确性:有时,UI中显示的数据不准确或不完整。例如,在此网站中,所有数字均四舍五入到小数点后两位,而后端有时提供的数据精度是后者的两倍以上。
这是您可以轻松使用后端API的方法:
import requests
# API url found using chrome devtools
url = 'https://www.marketscreener.com/charting/afDataFeed.php?codeZB=6491453&t=eec&sub_t=bna&iLang=2'
# We are mocking a chrome browser because the API is blocking python requests apparently
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
# Make a request to the API and parse the JSON response
data = requests.get(url, headers=headers).json()[0]
# A function to find data for a specific date
def get_vals(date):
    vals = []
    for items in data:
        for item in items:
            if item['t'] == date:
                vals.append(item['y'])
                break
    return vals
# Use the function above with the example table given in the question
print(get_vals('Thursday, Aug 22, 2019'))
运行此命令将输出列表[0.9, 0.84678, 0.76628, 0, 7, 0],如您所见,该列表是您要从作为示例给出的表中提取的数据。
| 归档时间: | 
 | 
| 查看次数: | 212 次 | 
| 最近记录: |