我必须从网页下载大量文档.它们是wmv文件,PDF,BMP等.当然,它们都有链接.所以每次我都要给RMC一个文件,选择'Save Link As'然后保存为Type All Files.是否可以在Python中执行此操作?我搜索了SO DB,人们回答了如何从网页上获取链接的问题.我想下载实际的文件.提前致谢.(这不是一个硬件问题:)).
rob*_*ing 25
以下是如何从http://pypi.python.org/pypi/xlwt下载一些所选文件的示例
您需要先安装mechanize:http://wwwsearch.sourceforge.net/mechanize/download.html
import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()
#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.
# Open your site
br.open('http://pypi.python.org/pypi/xlwt')
f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe
filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
for t in filetypes:
if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
myfiles.append(l)
def downloadlink(l):
f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.
br.click_link(l)
f.write(br.response().read())
print l.text," has been downloaded"
#br.back()
for l in myfiles:
sleep(1) #throttle so you dont hammer the site
downloadlink(l)
Run Code Online (Sandbox Code Playgroud)
注意:在某些情况下,您可能希望替换br.click_link(l)为br.follow_link(l).不同之处在于click_link返回Request对象,而follow_link将直接打开链接.请参阅br.click_link()和br.follow_link()之间的机械化差异
--limit,--recursive并在--accept命令行中Wget.例如:
wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/