使用硒从Highcharts中刮取数据

edy*_*y13 8 python selenium highcharts

我试图从高清图中抓取数据.我看了类似的问题,但不明白script_execute如何工作或如何使用我的浏览器检测js.这是我目前的代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Core settings
chrome_path = r"C:\Users\X\Y\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.implicitly_wait(15)

stats_url = 'https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/'

driver.get(stats_url)
driver.find_element_by_link_text('by Source').click()
driver.find_element_by_id('custom-date-range').click()
year = driver.find_element_by_id('date-range-start')
year.click()
for i in range(5): # goes back 5 years
    year.send_keys(Keys.ARROW_DOWN)
driver.find_element_by_id('date-range-submit').click()
Run Code Online (Sandbox Code Playgroud)

我想从图表中删除"下载"数据,(不仅仅是针对许多页面的此页面).当我使用自定义搜索选项时,网站自动生成的csv文件不会更新.所以唯一的方法是从图表中删除数据.我怎么能这样做?

Flo*_* B. 5

Mozilla提供了一个简单的REST API来获取统计信息,因此您不需要使用Selenium.

随着requests模块:

url = "https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20170823-20171023.json"
data = requests.get(url).json()
Run Code Online (Sandbox Code Playgroud)

要选择范围,只需更新URL中的日期即可.

但是,如果你仍然愿意用selenium废弃图表:

dates = driver.execute_script("return Highcharts.charts[0].series[0].xData");
users = driver.execute_script("return Highcharts.charts[0].series[0].yData");
downloads = driver.execute_script("return Highcharts.charts[0].series[1].yData");
Run Code Online (Sandbox Code Playgroud)


Dav*_*tti 4

我注意到了一件事.

似乎是这样的:

"当我使用自定义搜索选项时,网站自动生成的csv文件不会更新".

但事实上并非如此.它已更新,但最大"自定义数据范围"似乎为1年.

例如,如果您从设置2013-09-232017-10-23该.csv(以.json)产生具有最大1年的数据(从这个例子22/10/201621/10/2017).

如果你玩"极端",你可以更好地注意到这一点.

例如:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131023-20141023.json
Run Code Online (Sandbox Code Playgroud)
  • 第一要素: {"date": "2014-10-23", "count": 212730, "end": "2014-10-23"}
  • 最后一个元素 {"date": "2013-10-24", "count": 163094, "end": "2013-10-24"}

如果你改变:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131023-20141024.json
Run Code Online (Sandbox Code Playgroud)
  • 第一要素: {"date": "2014-10-24", "count": 215105, "end": "2014-10-24"}
  • 最后一个元素 {"date": "2013-10-25", "count": 168018, "end": "2013-10-25"}

或者:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131022-20141023.json
Run Code Online (Sandbox Code Playgroud)

将再次:

  • 第一要素: {"date": "2014-10-23", "count": 212730, "end": "2014-10-23"}
  • 最后一个元素 {"date": "2013-10-24", "count": 163094, "end": "2013-10-24"}

因此,为了获得过去5年的数据,您可以:

import subprocess
interestedYears=5;
year=1
today="2017-10-23"
tokenDataToday= today.split("-")
dateEnd=tokenDataToday[0]+tokenDataToday[1]+tokenDataToday[2]
url= "https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-"

while year <= interestedYears:
     yearStart= str(int(float(tokenDataToday[0]))-year)
     dateStart=yearStart+tokenDataToday[1]+tokenDataToday[2]
     #print("dateStart: " + dateStart)
     #print("dateEnd: " + dateEnd)
     tmpUrl=url+dateStart+"-"+dateEnd+".csv"
     cmd = 'curl -O ' + tmpUrl
     print(cmd)
     args = cmd.split()
     process = subprocess.Popen(args, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     stdout, stderr = process.communicate()
     dateEnd=dateStart
     year = year+1
     print("-----------------------------")
Run Code Online (Sandbox Code Playgroud)