que*_*ang 8 python selenium web-scraping
我试图将几个参数发布到此[url] [1]并按"submit"下载生成的csv文件.
我认为至少需要5个步骤.
由于尚未有人发布解决方案,所以您就开始吧。您不会在请求方面走得太远,因此 Selenium 是您的最佳选择。如果您想在不进行任何修改的情况下使用以下脚本,请检查:
dl_dir = '/tmp'为您想要的某个目录chromedriver安装了,或者在代码中将驱动程序更改为firefox(并根据firefox的需要调整下载目录配置)这是测试的环境:
$ python -V
Python 3.5.3
$ chromedriver --version
ChromeDriver 2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2)
$ pip list --format=freeze | grep selenium
selenium==3.6.0
Run Code Online (Sandbox Code Playgroud)
我几乎对每一行都进行了注释,所以让代码来说话:
import os
import time
from selenium import webdriver
from selenium.webdriver.common import by
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import ui, expected_conditions as EC
def main():
dl_dir = '/tmp' # temporary download dir so I don't spam the real dl dir with csv files
# check what files are downloaded before the scraping starts (will be explained later)
csvs_old = {file for file in os.listdir(dl_dir) if file.startswith('NXSA-Results-') and file.endswith('.csv')}
# I use chrome so check if you have chromedriver installed
# pass custom dl dir to browser instance
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : '/tmp'}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
# open page
driver.get('http://nxsa.esac.esa.int/nxsa-web/#search')
# wait for search ui to appear (abort after 10 secs)
# once there, unfold the filters panel
ui.WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((by.By.XPATH, '//td[text()="Observation and Proposal filters"]'))).click()
# toggle observation availability dropdown
driver.find_element_by_xpath('//input[@title="Observation Availability"]/../../td[2]/div/img').click()
# wait until the dropdown elements are available, then click "proprietary"
ui.WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((by.By.XPATH, '//div[text()="Proprietary" and @class="gwt-Label"]'))).click()
# unfold display options panel
driver.find_element_by_xpath('//td[text()="Display options"]').click()
# deselect "pointed observations"
driver.find_element_by_id('gwt-uid-241').click()
# select "epic exposures"
driver.find_element_by_id('gwt-uid-240').click()
# uncomment if you want to go through the activated settings and verify them
# when commented, the form is submitted immediately
#time.sleep(5)
# submit the form
driver.find_element_by_xpath('//button/span[text()="Submit"]/../img').click()
# wait until the results table has at least one row
ui.WebDriverWait(driver, 10).until(EC.presence_of_element_located((by.By.XPATH, '//tr[@class="MPI"]')))
# click on save
driver.find_element_by_xpath('//span[text()="Save table as"]').click()
# wait for dropdown with "CSV" entry to appear
el = ui.WebDriverWait(driver, 10).until(EC.element_to_be_clickable((by.By.XPATH, '//a[@title="Save as CSV, Comma Separated Values"]')))
# somehow, the clickability does not suffice - selenium still whines about the wrong element being clicked
# as a dirty workaround, wait a fixed amount of time to let js finish ui update
time.sleep(1)
# click on "CSV" entry
el.click()
# now. selenium can't tell whether the file is being downloaded
# we have to do it ourselves
# this is a quick-and-dirty check that waits until a new csv file appears in the dl dir
# replace with watchdogs or whatever
dl_max_wait_time = 10 # secs
seconds = 0
while seconds < dl_max_wait_time:
time.sleep(1)
csvs_new = {file for file in os.listdir(dl_dir) if file.startswith('NXSA-Results-') and file.endswith('.csv')}
if csvs_new - csvs_old: # new file found in dl dir
print('Downloaded file should be one of {}'.format([os.path.join(dl_dir, file) for file in csvs_new - csvs_old]))
break
seconds += 1
# we're done, so close the browser
driver.close()
# script entry point
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
如果一切正常,脚本应该输出:
Downloaded file should be one of ['/tmp/NXSA-Results-1509061710475.csv']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
340 次 |
| 最近记录: |