src*_*src 3 python wget python-requests
我有一个奇怪的错误.Dropbox上有一个文件,我正在使用以下python代码下载:
import requests
import shutil
url = 'https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0'
r = requests.get(url, stream=True)
path_to_save = "/tmp/data.dload-1"
with open(path_to_save, 'wb') as f:
shutil.copyfileobj(r.raw, f)
Run Code Online (Sandbox Code Playgroud)
这下载到/tmp/data.dload-1.
使用wget下载的同一文件 wget https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0 -O /tmp/data.dload-2
这两个文件具有相同的类型:
(dl)x:x$ file /tmp/data.dload-1
/tmp/data.dload-1: gzip compressed data, from Unix
(dl)x:x$ file /tmp/data.dload-2
/tmp/data.dload-2: gzip compressed data, last modified: Thu Apr 26 23:05:15 2018, from Unix
Run Code Online (Sandbox Code Playgroud)
但是没有它们会产生不同的结果:
(dl)x:x$ tar -zxvf /tmp/data.dload-1
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(dl) x:x$ tar -zxvf /tmp/data.dload-2
testfiles/a
testfiles/b
(dl)x:x$
Run Code Online (Sandbox Code Playgroud)
任何人都知道为什么会发生这种情况,更重要的是我如何下载该tar文件Python(最好requests)
这是由以下结果r.headers:
(dl) x:x$ python dload-test.py
{'Server': 'nginx', 'Date': 'Fri, 27 Apr 2018 17:27:06 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'no-cache', 'Content-Security-Policy': "script-src 'unsafe-eval' https://www.dropbox.com/static/compiled/js/ https://www.dropbox.com/static/javascript/ https://www.dropbox.com/static/api/ https://cfl.dropboxstatic.com/static/compiled/js/ https://www.dropboxstatic.com/static/compiled/js/ https://cfl.dropboxstatic.com/static/js/ https://www.dropboxstatic.com/static/js/ https://cfl.dropboxstatic.com/static/previews/ https://www.dropboxstatic.com/static/previews/ https://cfl.dropboxstatic.com/static/api/ https://www.dropboxstatic.com/static/api/ https://cfl.dropboxstatic.com/static/cms/ https://www.dropboxstatic.com/static/cms/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ 'unsafe-inline' ; img-src https://* data: blob: ; frame-ancestors 'self' ; default-src 'none' ; frame-src https://* carousel://* dbapi-6://* dbapi-7://* dbapi-8://* itms-apps://* itms-appss://* ; worker-src https://www.dropbox.com/static/serviceworker/ blob: ; style-src https://* 'unsafe-inline' 'unsafe-eval' ; connect-src https://* ws://127.0.0.1:*/ws ; object-src 'self' https://cfl.dropboxstatic.com/static/ https://www.dropboxstatic.com/static/ https://flash.dropboxstatic.com https://swf.dropboxstatic.com https://dbxlocal.dropboxstatic.com ; media-src https://* blob: ; font-src https://* data: ; child-src https://www.dropbox.com/static/serviceworker/ blob: ; form-action 'self' https://www.dropbox.com/ https://dl-web.dropbox.com/ https://photos.dropbox.com/ https://accounts.google.com/ https://api.login.yahoo.com/ https://login.yahoo.com/ ; base-uri 'self' api-stream.dropbox.com showbox-tr.dropbox.com ; report-uri https://www.dropbox.com/csp_log", 'Dropbox-Streaming': 'V=1', 'Pragma': 'no-cache', 'Referrer-Policy': 'origin-when-cross-origin', 'Set-Cookie': 'locale=en; Domain=dropbox.com; expires=Wed, 26 Apr 2023 17:27:06 GMT; Path=/; secure, gvc=OTU0NjExNzUwNjc0NjQxNzgwMzE0OTgzMzkzNjc3MzM5OTYzNzc%3D; expires=Wed, 26 Apr 2023 17:27:06 GMT; httponly; Path=/; secure, flash=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, puc=; expires=Fri, 27 Apr 2018 17:27:06 GMT; httponly; Path=/; secure, bang=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, seen-sl-signup-modal=VHJ1ZQ%3D%3D; expires=Sun, 27 May 2018 17:27:06 GMT; httponly; Path=/; secure, t=HlsAKcFI_HJWteio0_5ELyFf; Domain=dropbox.com; expires=Mon, 26 Apr 2021 17:27:06 GMT; httponly; Path=/; secure, __Host-js_csrf=HlsAKcFI_HJWteio0_5ELyFf; expires=Mon, 26 Apr 2021 17:27:06 GMT; Path=/; secure', 'X-Content-Type-Options': 'nosniff', 'X-Dropbox-Request-Id': 'b028e94ce7b814c7f25fb753449b641a', 'X-Frame-Options': 'DENY', 'X-Robots-Tag': 'noindex, nofollow, noimageindex', 'X-Xss-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Content-Encoding': 'gzip'}
文件被gzip压缩的问题,即使它已经是一个gzip压缩文件(从'Content-Encoding': 'gzip'字段中可以看出r.headers).
您正在使用默认请求标头,用于requests和wget.默认情况下,它们都会发送类似的内容'Accept-Encoding: gzip, deflate'.(如果打印出来,你可以看到这一点r.request.headers.)因此,服务器可以轻松地压缩文件并使用'Content-Encoding: gzip'标头将其发回.
双方wget并requests会在默认情况下,检测头和透明地进行解码数据你,但你已经明确告知requests 不这样做,并读取原始数据原样.
所以你最终保存了一个gzip-compressed-gzip-compressed-tarball文件.显然,file将报告为gzip compressed data,tar -z并将报告gzip中的内容does not look like a tar archive,因为它不是,它是一个gzipped tar压缩存档.
这里最小的修复是手动添加headers={'Accept-Encoding': 'identity'}到您的请求.
你可能想知道为什么服务器很难gzip压缩一个gzip压缩文件 - 只是因为你告诉它你可以接受gzip并不意味着你要求 gzip,对吧?
如果你看一下RFC 2616和RFC 7231,服务器应该选择它可以支持的客户端指定的最高qvalue(权重)的编码(根据一些未指定的启发式破坏联系).如果您的用户代理明确要求'gzip, deflate',identity除非实际上不可能做到这一点,否则您将是不正确的,不会有点愚蠢.
| 归档时间: |
|
| 查看次数: |
1016 次 |
| 最近记录: |