Rob*_*iaz 5 python zip amazon-s3 amazon-web-services
我试图列出“.zip”文件中的所有文件,而不下载全部文件。
我已经成功地使用以下代码对小于 4GB 的文件执行此操作:
def get_list_of_files_from_zip(self, source_bucket, source_key, ignore_hidden_files=True):
# self.s3 returns boto3.resource('s3') already initialize with the keys
s3_object = self.s3.Object(source_bucket, source_key)
size = s3_object.content_length
# End of central directory record (EOCD)
eocd = self._fetch_bytes_from_file(source_bucket, source_key, size - 22, 22)
# start offset and size of the central directory
cd_start = convert_to_int(eocd[16:20])
cd_size = convert_to_int(eocd[12:16])
# fetch central directory, append EOCD, and open as zipfile!
cd = self._fetch_bytes_from_file(source_bucket, source_key, cd_start, cd_size)
zip = ZipFile(BytesIO(cd + eocd))
list_of_file = []
for entry in zip.filelist:
if ignore_hidden_files and (entry.file_size == 0 or is_hidden(entry.filename)):
continue
list_of_file.append({"name": entry.filename,
"size": entry.file_size}) # On bytes
return list_of_file
def _fetch_bytes_from_file(self, source_bucket, source_key, start, len):
"""
range-fetches a S3 key
"""
end = start + len - 1
s3_object = self.s3.Object(source_bucket, source_key).get(Range="bytes=%d-%d" % (start, end))
return s3_object['Body'].read()
def convert_to_int(bytes):
val = ord(bytes[0]) + (ord(bytes[1]) << 8)
if len(bytes) > 3:
val += (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)
return val
Run Code Online (Sandbox Code Playgroud)
问题是我尝试对 70GB 的文件执行相同的操作,我收到的是这样的:
Traceback (most recent call last):
File "/Users/.../env/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-b4cd8dc7616e>", line 1, in <module>
s3.get_list_of_files_from_zip(bucket_name,key_name)
File "/Users/.../base.py", line 153, in get_list_of_files_from_zip
zip = ZipFile(BytesIO(cd + eocd))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 770, in __init__
self._RealGetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 839, in _RealGetContents
raise BadZipfile("Bad magic number for central directory")
BadZipfile: Bad magic number for central directory
Run Code Online (Sandbox Code Playgroud)
经过一些研究后,我发现超过 4GB 的 zip 文件具有不同的结构
并根据规范(搜索“4.3.15 Zip64 end ofcentral Directory locator”)
“中央目录定位器的 Zip64 结尾”应该帮助我找到“中央目录记录的 Zip64 结尾”,这将允许我提取 zip64 文件的中央目录的起始结束长度。
所以我所做的是:
size_eocd = 22 # End of central directory record
size_Zip64EndCD = 20
Zip64EndCD = self._fetch_bytes_from_file(source_bucket, source_key, size - (size_eocd + size_Zip64EndCD), size_Zip64EndCD)
# relative offset of the zip64 end of central directory record 8 bytes
relative_offset = convert_to_int(Zip64EndCD[8:16])
# result in my example relative_offset = 1811690735, size = 74826134865
Run Code Online (Sandbox Code Playgroud)
这就是我迷失的地方,文档说这是“中央目录的 zip 64 结尾的相对偏移量”,但它没有说相对于什么偏移量(大小?cd 位置???)
我尝试了以下操作,但没有找到“中央目录签名的 zip64 结尾”= 0x06064b50
"\x50\x4b\x06\x06" in self._fetch_bytes_from_file(source_bucket, source_key, size - relative_offset, 3000)
Run Code Online (Sandbox Code Playgroud)
我做错了什么?
我很早之前写过zipdetails来帮助我理解 zip 文件的内部结构。
让我们创建一个 zip64 zip 文件(该-fz选项将强制 Zip64)。
$ zip -fz xx.zip /tmp/Makefile
$ unzip -l xx.zip
Archive: xx.zip
Length Date Time Name
--------- ---------- ----- ----
1240 02-05-2020 14:31 tmp/Makefile
--------- -------
1240 1 file
Run Code Online (Sandbox Code Playgroud)
如果您zipdetails针对 zip64 zip 文件运行,并查看中央目录数据所在的末尾,您将看到类似这样的内容。我添加了额外的注释来显示您需要设置的指针值。因此,您需要将“中央目录记录的 zip64 结尾的相对偏移量”字段设置为指向“中央目录定位器的 Zip64 结尾”字段的位置。在本例中为十六进制 299。
0299 ZIP64 END CENTRAL DIR 06064B50 <----------------+
RECORD |
029D Size of record 000000000000002C |
02A5 Created Zip Spec 1E '3.0' |
02A6 Created OS 03 'Unix' |
02A7 Extract Zip Spec 2D '4.5' |
02A8 Extract OS 00 'MS-DOS' |
02A9 Number of this disk 00000000 |
02AD Central Dir Disk no 00000000 |
02B1 Entries in this disk 0000000000000001 |
02B9 Total Entries 0000000000000001 |
02C1 Size of Central Dir 000000000000005E |
02C9 Offset to Central dir 000000000000023B |
|
02D1 ZIP64 END CENTRAL DIR 07064B50 |
LOCATOR |
02D5 Central Dir Disk no 00000000 |
02D9 Offset to Central dir 0000000000000299 ---------+
02E1 Total no of Disks 00000001
02E5 END CENTRAL HEADER 06054B50
02E9 Number of this disk 0000
02EB Central Dir Disk no 0000
02ED Entries in this disk 0001
02EF Total Entries 0001
02F1 Size of Central Dir 0000005E
02F5 Offset to Central Dir FFFFFFFF
02F9 Comment Length 0000
Done
Run Code Online (Sandbox Code Playgroud)
编辑:更新了 URLzipdetails
| 归档时间: |
|
| 查看次数: |
1097 次 |
| 最近记录: |