Shu*_*man 10 python urllib web-scraping python-requests
您好我一直在使用此代码段从网站下载文件,到目前为止,小于1GB的文件都很好.但我注意到1.5GB的文件不完整
# s is requests session object
r = s.get(fileUrl, headers=headers, stream=True)
start_time = time.time()
with open(local_filename, 'wb') as f:
count = 1
block_size = 512
try:
total_size = int(r.headers.get('content-length'))
print 'file total size :',total_size
except TypeError:
print 'using dummy length !!!'
total_size = 10000000
for chunk in r.iter_content(chunk_size=block_size):
if chunk: # filter out keep-alive new chunks
duration = time.time() - start_time
progress_size = int(count * block_size)
if duration == 0:
duration = 0.1
speed = int(progress_size / (1024 * duration))
percent = int(count * block_size * 100 / total_size)
sys.stdout.write("\r...%d%%, %d MB, %d KB/s, %d seconds passed" %
(percent, progress_size / (1024 * 1024), speed, duration))
f.write(chunk)
f.flush()
count += 1
Run Code Online (Sandbox Code Playgroud)
使用最新请求2.2.1 python 2.6.6,centos 6.4文件下载总是停在66.7%1024MB,我缺少什么?输出:
file total size : 1581244542
...67%, 1024 MB, 5687 KB/s, 184 seconds passed
Run Code Online (Sandbox Code Playgroud)
似乎iter_content()返回的生成器认为检索了所有块并且没有错误.顺便说一下,异常部分没有运行,因为服务器确实在响应头中返回了内容长度.
请仔细检查您是否可以通过和/或任何常规浏览器下载该文件wget。可能是服务器的限制。据我所知,您的代码可以下载大文件(大于 1.5Gb)
更新:请尝试反转逻辑 - 而不是
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
Run Code Online (Sandbox Code Playgroud)
尝试
if not chunk:
break
f.write(chunk)
f.flush()
Run Code Online (Sandbox Code Playgroud)