Gaa*_*ara 8 python selenium web-crawler web-scraping selenium-webdriver
我是selenium的新手,我正在编写一个刮刀,可以从给定的站点自动下载pdf文件.
以下是我的代码:
from selenium import webdriver
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2);
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/PUBLICATIONS/DM/MADHURAI/2015/05/26/PagePrint//26_05_2015_001_b2b69fda315301809dda359a6d3d9689.pdf");
webobj = browser.find_element_by_id("download").click();
Run Code Online (Sandbox Code Playgroud)
我按照Selenium 文档和此链接中提到的步骤进行操作.我不确定为什么每次都会显示下载对话框.
有没有办法解决它,否则有一种方法可以提供"应用程序/所有",以便可以下载所有文件(解决方法)?
禁用内置pdfjs
插件并导航到URL - PDF文件将自动下载,代码:
from selenium import webdriver
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
fp.set_preference("pdfjs.disabled", "true") # < KEY PART HERE
browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/PUBLICATIONS/DM/MADHURAI/2015/05/26/PagePrint//26_05_2015_001_b2b69fda315301809dda359a6d3d9689.pdf");
Run Code Online (Sandbox Code Playgroud)
更新(对我有用的完整代码):
from selenium import webdriver
mime_types = "application/pdf,application/vnd.adobe.xfdf,application/vnd.fdf,application/vnd.adobe.xdp+xml"
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "/home/aafanasiev/Downloads")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", mime_types)
fp.set_preference("plugin.disable_full_page_plugin_for_types", mime_types)
fp.set_preference("pdfjs.disabled", True)
browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/")
webobj_get_link = browser.find_element_by_id("liSavePdf")
webobj_get_object = webobj_get_link.find_element_by_tag_name("a")
webobj_get_object.click()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
9978 次 |
最近记录: |