rob*_*txt 9 python selenium web-scraping python-3.x selenium-webdriver
我在python中编写了一个与selenium结合使用的脚本,从网页上下载了一些文档文件(以.doc结尾).我不想使用requests或urllib模块下载文件的原因是因为我正在使用的网站没有连接到每个文件的任何真实URL.它们是javascript加密的.但是,我在我的脚本中选择了一个模仿相同的链接.
我的脚本此刻做了什么:
(this is what I need rectified)如何修改我的脚本以下载文件,启动单击其链接并将下载的文件放入相关文件夹中?
到目前为止,这是我的尝试:
import os
import time
from selenium import webdriver
link ='https://www.online-convert.com/file-format/doc'
dirf = os.path.expanduser('~')
desk_location = dirf + r'\Desktop\file_folder'
if not os.path.exists(desk_location):os.mkdir(desk_location)
def download_files():
driver.get(link)
for item in driver.find_elements_by_css_selector("a[href$='.doc']")[:2]:
filename = item.get_attribute("href").split("/")[-1]
#creating new folder in accordance with filename to store the downloaded file in thier concerning folder
folder_name = item.get_attribute("href").split("/")[-1].split(".")[0]
#set the new location of the folders to be created
new_location = os.path.join(desk_location,folder_name)
if not os.path.exists(new_location):os.mkdir(new_location)
#set the location of the folders the downloaded files will be within
file_location = os.path.join(new_location,filename)
item.click()
time_to_wait = 10
time_counter = 0
try:
while not os.path.exists(file_location):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:break
except Exception:pass
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory' : desk_location,
'profile.default_content_setting_values.automatic_downloads': 1
}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
download_files()
Run Code Online (Sandbox Code Playgroud)
下图显示了当前存储下载文件的方式(the files are outside of their concerning folders):
我刚刚添加了文件的重命名来移动它.所以它就像你拥有它一样工作,但是一旦它下载文件,就会将它移动到正确的路径:
os.rename(desk_location + '\\' + filename, file_location)
完整代码:
import os
import time
from selenium import webdriver
link ='https://www.online-convert.com/file-format/doc'
dirf = os.path.expanduser('~')
desk_location = dirf + r'\Desktop\file_folder'
if not os.path.exists(desk_location):
os.mkdir(desk_location)
def download_files():
driver.get(link)
for item in driver.find_elements_by_css_selector("a[href$='.doc']")[:2]:
filename = item.get_attribute("href").split("/")[-1]
#creating new folder in accordance with filename to store the downloaded file in thier concerning folder
folder_name = item.get_attribute("href").split("/")[-1].split(".")[0]
#set the new location of the folders to be created
new_location = os.path.join(desk_location,folder_name)
if not os.path.exists(new_location):
os.mkdir(new_location)
#set the location of the folders the downloaded files will be within
file_location = os.path.join(new_location,filename)
item.click()
time_to_wait = 10
time_counter = 0
try:
while not os.path.exists(file_location):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:break
os.rename(desk_location + '\\' + filename, file_location)
except Exception:pass
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory' : desk_location,
'profile.default_content_setting_values.automatic_downloads': 1
}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
download_files()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
358 次 |
| 最近记录: |