Python使用HTTP在远程文件上搜索

Mar*_*oni 11 python http seek

如何在远程(HTTP)文件上寻找特定位置,以便我只能下载该部分?

让我们说远程文件上的字节是:1234567890

我想寻求4并从那里下载3个字节,所以我会:456

另外,如何检查远程文件是否存在?我试过,os.path.isfile()但是当我传递一个远程文件url时它返回False.

jbo*_*chi 16

如果要通过HTTP下载远程文件,则需要设置Range标头.

在这个例子中检查它是如何完成的.看起来像这样:

myUrlclass.addheader("Range","bytes=%s-" % (existSize))
Run Code Online (Sandbox Code Playgroud)

编辑:我刚刚找到了更好的实现.这个类使用起来非常简单,因为它可以在docstring中看到.

class HTTPRangeHandler(urllib2.BaseHandler):
"""Handler that enables HTTP Range headers.

This was extremely simple. The Range header is a HTTP feature to
begin with so all this class does is tell urllib2 that the 
"206 Partial Content" reponse from the HTTP server is what we 
expected.

Example:
    import urllib2
    import byterange

    range_handler = range.HTTPRangeHandler()
    opener = urllib2.build_opener(range_handler)

    # install it
    urllib2.install_opener(opener)

    # create Request and set Range header
    req = urllib2.Request('http://www.python.org/')
    req.header['Range'] = 'bytes=30-50'
    f = urllib2.urlopen(req)
"""

def http_error_206(self, req, fp, code, msg, hdrs):
    # 206 Partial Content Response
    r = urllib.addinfourl(fp, hdrs, req.get_full_url())
    r.code = code
    r.msg = msg
    return r

def http_error_416(self, req, fp, code, msg, hdrs):
    # HTTP's Range Not Satisfiable error
    raise RangeError('Requested Range Not Satisfiable')
Run Code Online (Sandbox Code Playgroud)

更新:"更好的实现"已移至byterange.py文件中的github:excid3/urlgrabber.


nul*_*atz 5

我强烈建议使用请求库.它是我用过的最好的HTTP库.特别是,要完成你所描述的内容,你会做类似的事情:

import requests

url = "http://www.sffaudio.com/podcasts/ShellGameByPhilipK.Dick.pdf"

# Retrieve bytes between offsets 3 and 5 (inclusive).
r = requests.get(url, headers={"range": "bytes=3-5"})

# If a 4XX client error or a 5XX server error is encountered, we raise it.
r.raise_for_status()
Run Code Online (Sandbox Code Playgroud)