如何以"更智能"的方式使用python下载文件?

ken*_*der 67 python http download

我需要通过Python中的http下载几个文件.

最明显的方法是使用urllib2:

import urllib2
u = urllib2.urlopen('http://server.com/file.html')
localFile = open('file.html', 'w')
localFile.write(u.read())
localFile.close()
Run Code Online (Sandbox Code Playgroud)

但我必须以某种方式处理那些令人讨厌的网址,比如说:http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf.通过浏览器下载时,该文件具有可读的名称,即.accounts.pdf.

有没有办法在python中处理它,所以我不需要知道文件名并将它们硬编码到我的脚本中?

Oli*_*Oli 41

下载这样的脚本往往会推送一个标题告诉用户代理文件的名称:

Content-Disposition: attachment; filename="the filename.ext"
Run Code Online (Sandbox Code Playgroud)

如果你可以抓住那个标题,你可以获得正确的文件名.

还有另一个线程有一些代码可以提供Content-Disposition-grabbing.

remotefile = urllib2.urlopen('http://example.com/somefile.zip')
remotefile.info()['Content-Disposition']
Run Code Online (Sandbox Code Playgroud)

  • 不,他们可能会重定向到普通文件.但如果它像大多数下载脚本一样,那么它们就会推动内容处理.一定要检查. (5认同)

ken*_*der 35

根据评论和@Oli的anwser,我做了这样的解决方案:

from os.path import basename
from urlparse import urlsplit

def url2name(url):
    return basename(urlsplit(url)[2])

def download(url, localFileName = None):
    localName = url2name(url)
    req = urllib2.Request(url)
    r = urllib2.urlopen(req)
    if r.info().has_key('Content-Disposition'):
        # If the response has Content-Disposition, we take file name from it
        localName = r.info()['Content-Disposition'].split('filename=')[1]
        if localName[0] == '"' or localName[0] == "'":
            localName = localName[1:-1]
    elif r.url != url: 
        # if we were redirected, the real file name we take from the final URL
        localName = url2name(r.url)
    if localFileName: 
        # we can force to save the file as specified name
        localName = localFileName
    f = open(localName, 'wb')
    f.write(r.read())
    f.close()
Run Code Online (Sandbox Code Playgroud)

它从Content-Disposition获取文件名; 如果它不存在,则使用URL中的文件名(如果重定向发生,则考虑最终的URL).

  • 我发现这很有用.但是为了下载更大的文件而不将它们存储在内存中,我必须找到这个,将你的'r'复制到'f':import shutil shutil.copyfileobj(r,f) (9认同)
  • 工作得很好,但我会调用`urlsplit(url)[2]`调用`urllib.unquote`,否则文件名将被百分比编码.这是我正在做的:`return basename(urllib.unquote(urlsplit(url)[2]))` (4认同)

los*_*gic 23

结合以上的大部分内容,这是一个更加pythonic的解决方案:

import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        # if no filename was found above, parse it out of the final URL.
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()
Run Code Online (Sandbox Code Playgroud)