Trouble parsing tabular items from a graph located in a website

Question

Trouble parsing tabular items from a graph located in a website

MIT*_*THU 7 python selenium web-scraping python-3.x selenium-webdriver

I'm trying to extract the tabular contents available on a graph in a webpage. The content of those tables are only visible when someone hovers his cursor within the area. One such table is this one.

Webpage address

表格在其中的标题为EPS consensus revisions : last 18 months。

到目前为止，我已经尝试过：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.marketscreener.com/SUNCORP-GROUP-LTD-6491453/revisions/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)
for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#graphRevisionBNAeec span > table tr"))):
    data = [item.text for item in items.find_elements_by_css_selector("td")]
    print(data)
driver.quit()

Run Code Online (Sandbox Code Playgroud)

当我运行上面的脚本时，它会raise TimeoutException(message, screen, stacktrace):selenium.common.exceptions.TimeoutException: Message:指向该for items in wait.until()行引发错误。

多个表中的一个表的输出应如下所示：

Period: Thursday, Aug 22, 2019
Number of upgrading estimates: 0
Number of unchanged estimates: 7
Number of Downgrading estimates: 0
High Value: 0.90 AUD
Mean Value: 0.85 AUD
Low Value: 0.77 AUD

Run Code Online (Sandbox Code Playgroud)

如何从该图中获取这些表的内容？

编辑：我仍然期待纯粹基于任何浏览器模拟器的任何解决方案。

Answer 1

kma*_*ork 6

直接查询网站的后端比使用硒来刮除前端要好得多，这有以下三个重要原因：

速度：直接使用API可以大大，更快，更高效，因为它仅获取您需要的数据，而无需等待javascript运行或像素渲染，并且没有运行webdriver的开销。
稳定性：通常，对前端的更改要比对后端的更改更为频繁且难以遵循。如果您的代码依赖于网站的前端，则当他们进行一些UI更改时，它可能会很快停止工作。
准确性：有时，UI中显示的数据不准确或不完整。例如，在此网站中，所有数字均四舍五入到小数点后两位，而后端有时提供的数据精度是后者的两倍以上。

这是您可以轻松使用后端API的方法：

import requests
# API url found using chrome devtools
url = 'https://www.marketscreener.com/charting/afDataFeed.php?codeZB=6491453&t=eec&sub_t=bna&iLang=2'
# We are mocking a chrome browser because the API is blocking python requests apparently
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
# Make a request to the API and parse the JSON response
data = requests.get(url, headers=headers).json()[0]
# A function to find data for a specific date
def get_vals(date):
    vals = []
    for items in data:
        for item in items:
            if item['t'] == date:
                vals.append(item['y'])
                break
    return vals
# Use the function above with the example table given in the question
print(get_vals('Thursday, Aug 22, 2019'))

Run Code Online (Sandbox Code Playgroud)

运行此命令将输出列表[0.9, 0.84678, 0.76628, 0, 7, 0]，如您所见，该列表是您要从作为示例给出的表中提取的数据。

归档时间：	6 年，6 月前
查看次数：	212 次
最近记录：	6 年，5 月前