列出超过 4GB 的 zip 文件内的文件,而无需全部下载

Rob*_*iaz 5 python zip amazon-s3 amazon-web-services

我试图列出“.zip”文件中的所有文件,而不下载全部文件。

我已经成功地使用以下代码对小于 4GB 的文件执行此操作:

def get_list_of_files_from_zip(self, source_bucket, source_key, ignore_hidden_files=True):

    # self.s3 returns boto3.resource('s3') already initialize with the keys 
    s3_object = self.s3.Object(source_bucket, source_key)
    size = s3_object.content_length

    # End of central directory record (EOCD)
    eocd = self._fetch_bytes_from_file(source_bucket, source_key, size - 22, 22)

    # start offset and size of the central directory
    cd_start = convert_to_int(eocd[16:20])
    cd_size = convert_to_int(eocd[12:16])

    # fetch central directory, append EOCD, and open as zipfile!
    cd = self._fetch_bytes_from_file(source_bucket, source_key, cd_start, cd_size)
    zip = ZipFile(BytesIO(cd + eocd))

    list_of_file = []
    for entry in zip.filelist:

        if ignore_hidden_files and (entry.file_size == 0 or is_hidden(entry.filename)):
            continue

        list_of_file.append({"name": entry.filename,
                             "size": entry.file_size})  # On bytes
    return list_of_file

def _fetch_bytes_from_file(self, source_bucket, source_key, start, len):
    """
    range-fetches a S3 key
    """
    end = start + len - 1
    s3_object = self.s3.Object(source_bucket, source_key).get(Range="bytes=%d-%d" % (start, end))
    return s3_object['Body'].read()



def convert_to_int(bytes):

    val = ord(bytes[0]) + (ord(bytes[1]) << 8)
    if len(bytes) > 3:
        val += (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)
    return val
Run Code Online (Sandbox Code Playgroud)

问题是我尝试对 70GB 的文件执行相同的操作,我收到的是这样的:

Traceback (most recent call last):
  File "/Users/.../env/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3035, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-b4cd8dc7616e>", line 1, in <module>
    s3.get_list_of_files_from_zip(bucket_name,key_name)
  File "/Users/.../base.py", line 153, in get_list_of_files_from_zip
    zip = ZipFile(BytesIO(cd + eocd))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 839, in _RealGetContents
    raise BadZipfile("Bad magic number for central directory")
BadZipfile: Bad magic number for central directory
Run Code Online (Sandbox Code Playgroud)

经过一些研究后,我发现超过 4GB 的 zip 文件具有不同的结构

Zip64 文件结构

并根据规范(搜索“4.3.15 Zip64 end ofcentral Directory locator”)

“中央目录定位器的 Zip64 结尾”应该帮助我找到“中央目录记录的 Zip64 结尾”,这将允许我提取 zip64 文件的中央目录的起始结束长度。

所以我所做的是:

size_eocd = 22 # End of central directory record
size_Zip64EndCD = 20
Zip64EndCD = self._fetch_bytes_from_file(source_bucket, source_key, size - (size_eocd + size_Zip64EndCD), size_Zip64EndCD)

# relative offset of the zip64 end of central directory record 8 bytes
relative_offset = convert_to_int(Zip64EndCD[8:16]) 
# result in my example relative_offset = 1811690735, size = 74826134865
Run Code Online (Sandbox Code Playgroud)

这就是我迷失的地方,文档说这是“中央目录的 zip 64 结尾的相对偏移量”,但它没有说相对于什么偏移量(大小?cd 位置???)

我尝试了以下操作,但没有找到“中央目录签名的 zip64 结尾”= 0x06064b50

"\x50\x4b\x06\x06" in self._fetch_bytes_from_file(source_bucket, source_key, size - relative_offset, 3000)
Run Code Online (Sandbox Code Playgroud)

我做错了什么?

pmq*_*mqs 2

我很早之前写过zipdetails来帮助我理解 zip 文件的内部结构。

让我们创建一个 zip64 zip 文件(该-fz选项将强制 Zip64)。

$ zip -fz xx.zip /tmp/Makefile

$ unzip -l xx.zip
Archive:  xx.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     1240  02-05-2020 14:31   tmp/Makefile
---------                     -------
     1240                     1 file
Run Code Online (Sandbox Code Playgroud)

如果您zipdetails针对 zip64 zip 文件运行,并查看中央目录数据所在的末尾,您将看到类似这样的内容。我添加了额外的注释来显示您需要设置的指针值。因此,您需要将“中央目录记录的 zip64 结尾的相对偏移量”字段设置为指向“中央目录定位器的 Zip64 结尾”字段的位置。在本例中为十六进制 299。

0299 ZIP64 END CENTRAL DIR 06064B50  <----------------+ 
     RECORD                                           |
029D Size of record        000000000000002C           |
02A5 Created Zip Spec      1E '3.0'                   |
02A6 Created OS            03 'Unix'                  |
02A7 Extract Zip Spec      2D '4.5'                   |
02A8 Extract OS            00 'MS-DOS'                |
02A9 Number of this disk   00000000                   |
02AD Central Dir Disk no   00000000                   |
02B1 Entries in this disk  0000000000000001           |
02B9 Total Entries         0000000000000001           |
02C1 Size of Central Dir   000000000000005E           |
02C9 Offset to Central dir 000000000000023B           |
                                                      |
02D1 ZIP64 END CENTRAL DIR 07064B50                   |
     LOCATOR                                          |
02D5 Central Dir Disk no   00000000                   |
02D9 Offset to Central dir 0000000000000299  ---------+
02E1 Total no of Disks     00000001

02E5 END CENTRAL HEADER    06054B50
02E9 Number of this disk   0000
02EB Central Dir Disk no   0000
02ED Entries in this disk  0001
02EF Total Entries         0001
02F1 Size of Central Dir   0000005E
02F5 Offset to Central Dir FFFFFFFF
02F9 Comment Length        0000
Done
Run Code Online (Sandbox Code Playgroud)

编辑:更新了 URLzipdetails