Shr*_*ut1 7 truncated python-3.x python-requests
这个问题与我见过的其他问题不同,requests.iter_content()
它requests
似乎认为它已成功到达我正在迭代的文件的末尾。实际上,该文件已被截断且不完整。我尝试处理的文件是一个 17gb gzip,需要丰富并存储在数据库中。浏览器可以很好地下载该文件。
为什么这个文件没有完全下载,requests
如果无法下载整个文件,为什么不抛出异常?
源代码:(更新 - 请参阅编辑)
这是我的“阅读器”功能 - 它是处理数据的多处理脚本的一部分:
def patch_urllib3():
"""Set urllib3's enforce_content_length to True by default."""
previous_init = urllib3.HTTPResponse.__init__
def new_init(self, *args, **kwargs):
previous_init(self, *args, enforce_content_length = True, **kwargs)
urllib3.HTTPResponse.__init__ = new_init
def reader(target_url, data_queue, coordinator_queue, chunk_size):
patch_urllib3()
#Using zlib.MAX_WBITS|32 apparently forces zlib to detect the appropriate header for the data
decompressor = zlib.decompressobj(zlib.MAX_WBITS|32)
#Stream this file in as a request - pull the content in just a little at a time
#This should remain open until completion.
with requests.get (target_url, stream=True) as remote_file:
last_line="" #start this blank
#Chunk size can be adjusted to test performance
for data_chunk in remote_file.iter_content(chunk_size=4096):
#Decompress the current chunk
decompressed_chunk=decompressor.decompress(dns_chunk)
#These characters are in "byte" format and need to be decoded to utf-8
decompressed_chunk=decompressed_chunk.decode()
#Append the "last line" to add any fragments from the last chunk - it is blank the first time around
#This basically sticks line fragments from the last chunk onto the front of the current chunk.
decompressed_chunk=last_line+decompressed_chunk
#Run a split here; this is likely a costly step...
data_chunk=list(decompressed_chunk.splitlines())
#Pop the last line off the chunk since it isn't likely to be complete
#We'll add it to the front of the next chunk
last_line=dns_chunk.pop()
data_queue.put(data_chunk)
coordinator_queue.put('CHUNK_READ')
#File is fully read so send the last line and let the reader exit:
print("Sending last line.")
data_queue.put(last_line)
#Notify coordinator process of task completion
coordinator_queue.put('READ_DONE')
Run Code Online (Sandbox Code Playgroud)
补充笔记:
last_line
这让我在最后留下了碎片(然后破坏了我的数据处理功能)。with
条款和stream=True
论点将有助于防止会议提前结束。requests
当请求“完成”时,库
中没有任何错误。Sending last line.
根据我的代码示例打印with
条款已“成功”完成。更新:
我发现一篇博客文章直接讨论了这个问题。默认情况下,该requests
包似乎未将该urllib3
enforce_content_length
选项设置为 true。没有办法直接完成此操作requests
,因此必须urllib3
在设置requests
对象之前“修补”选项。请注意此 gituhub 问题中列出的def patch_urllib3()
函数。我已经更新了我的源代码以包含此函数,现在当读取过早停止时我收到以下错误:
urllib3.exceptions.IncompleteRead:IncompleteRead(读取52079993字节,预计更多18453799085字节)
urllib3.exceptions.ProtocolError:('连接中断:IncompleteRead(读取52079993字节,预计更多18453799085)',IncompleteRead(读取52079993字节,预计更多18453799085))
requests.exceptions.ChunkedEncodingError:('连接中断:IncompleteRead(52079993字节读取,18453799085更多预期)',IncompleteRead(52079993字节读取,18453799085更多预期))
我仍在努力查看是否有解决这些错误的方法或是否有能力恢复文件下载。我尝试发送另一个请求并包含range
从先前下载结束的位置开始的标头,但似乎对初始请求的中断是无法恢复的。