TTT*_*TTT 13 python networking download python-requests
我使用Python Requests库下载一个大文件,例如:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
Run Code Online (Sandbox Code Playgroud)
大文件的下载速度为每秒+30 Kb,这有点慢.与bigfile服务器的每个连接都受到限制,所以我想建立多个连接.
有没有办法同时建立多个连接来下载一个文件?
Vyk*_*tor 21
您可以使用HTTP Range标头来获取文件的一部分(这里已经为python提供了).
只需启动多个线程并获取不同的范围,然后就完成了;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Run Code Online (Sandbox Code Playgroud)
另请注意,并非每个服务器都支持Range标头(特别是负责数据提取的php脚本的服务器通常不会实现对它的处理).
这是一个Python脚本,它将给定的URL保存到文件中并使用多个线程下载它:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
Run Code Online (Sandbox Code Playgroud)
如果服务器返回空主体或416 http代码,或者响应大小不chunksize完全,则检测到文件结尾.
它支持不理解Range标题的服务器(在这种情况下,所有内容都在单个请求中下载;支持大文件,更改download_chunk()为保存到临时文件并返回要在主线程中读取的文件名而不是文件内容本身).
它允许独立更改单个http请求中请求的并发连接数(池大小)和请求的字节数.
要使用多个进程而不是线程,请更改导入:
from multiprocessing.pool import Pool # use processes (other code unchanged)
Run Code Online (Sandbox Code Playgroud)