use*_*827 2 pdf zlib python-3.x
我对 zlib 数据格式很好奇,并试图理解 RFC1950 中描述的 zlib 标头(https://tools.ietf.org/html/rfc1950 ) 中。然而,我对这种低级解释很陌生,似乎与我的一些结论相冲突。
我有以下压缩数据(来自 PDF 流对象):
b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
Run Code Online (Sandbox Code Playgroud)
在python中,我已经成功解压并重新压缩了数据:
b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
Run Code Online (Sandbox Code Playgroud)
正如我所了解的Deflate 和 inflate for PDF 中的讨论/答案,使用 zlib C++ 压缩数据的结果差异应该无关紧要,因为它是不同应用方法压缩数据的影响。
假设最后四个字节!\xa4\x03\xc4是 ADLER32(Adler-32 校验和),我的问题与前 2 个字节有关。
0 1 0 1 2 3 0 1 2 3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| | [DICTID] | |...compressed data...| | ADLER32 |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
Run Code Online (Sandbox Code Playgroud)
第一个字节代表 CMF,在我的两个实例中是
chr h = dec 104 = hex 68 = 01101000chr x = dec 120 = hex 78 = 01111000该字节根据压缩方法分为 4 位压缩方法和 4 位信息字段。
位 0 到 3 CM 压缩方法
位 4 到 7 CINFO 压缩信息
+----|----+ +----|----+ +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+ +----|----+ +----|----+
CM |CINFO CM |CINFO CM |CINFO
Run Code Online (Sandbox Code Playgroud)
在哪里
[CM] 标识文件中使用的压缩方法。CM = 8 表示“deflate”压缩方法,窗口大小高达 >32K。这是 gzip 和 PNG 使用的方法(见 CM = 15 保留。
和
对于 CM = 8,CINFO 是 LZ77 窗口大小的以 2 为底的对数减去 8(CINFO=7 表示 32K 窗口大小)。此版本的规范中不允许 CINFO 值大于 7。对于不等于 8 的 CM,本规范中未定义 CINFO。
据我了解,
你不应该假设它总是 8。相反,你应该检查它,如果它不是 8,则抛出“不支持”错误。
参见https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ
zlib 头文件的所有 64 种当前可能性的详尽列表:
COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d 18 19 28 15 38 11 48 0d 58 09 68 05
08 5b 18 57 28 53 38 4f 48 4b 58 47 68 43
08 99 18 95 28 91 38 8d 48 89 58 85 68 81
08 d7 18 d3 28 cf 38 cb 48 c7 58 c3 68 de
VERY RARE
08 3c 18 38 28 34 38 30 48 2c 58 28 68 24 78 3f
08 7a 18 76 28 72 38 6e 48 6a 58 66 68 62 78 7d
08 b8 18 b4 28 b0 38 ac 48 a8 58 a4 68 bf 78 bb
08 f6 18 f2 28 ee 38 ea 48 e6 58 e2 68 fd 78 f9
Run Code Online (Sandbox Code Playgroud)
据我所知,字节顺序在这里不是问题。我怀疑它可能与最低有效位(RFC1950 § 2.1。总体约定)有关,但我不太明白它会如何导致,例如,78 而不是 87 ......
第二个字节代表 FLG
\xde -> 11011110
\xda -> 11011010
Run Code Online (Sandbox Code Playgroud)
[FLG] [...] 划分如下:
位 0 到 4 FCHECK(CMF 和 FLG 的检查位)
位 5 FDICT(预设字典)
位 6 到 7 FLEVEL(压缩级别)
+-----|-|--+ +-----|-|--+ +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+ +-----|-|--+ +-----|-|--+
C |D| L C |D| L C |D| L
Run Code Online (Sandbox Code Playgroud)
Bit 0-4 as far as I can tell is some form of "checksum" or integrity control?
Bit 5 indicate whether a dictionary is present.
FDICT (Preset dictionary) If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.
Assuming that "1" indicates "is set"
\xde -> 11011_1_10
\xda -> 11011_0_10
Run Code Online (Sandbox Code Playgroud)
According to the specification DICTID consist of 4 bytes. The four following bytes in the compressed streams I have are
bbd\x10
cbd\x10
Run Code Online (Sandbox Code Playgroud)
Why are the compressed data from the PDF stream object (with the FDICT 1) and the compressed data with python zlib (with the FDICT 0) almost identical?
Granted that I do not understand the function of the DICTID, but is it not supposed to exist only if FDICT is set?
Bit 6-7 sets the FLEVEL (Compression level)
These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.
I would have thought that the flags would be:
0 (00)
1 (01)
2 (10)
3 (11)
Run Code Online (Sandbox Code Playgroud)
However from the What does a zlib header look like?
01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
Run Code Online (Sandbox Code Playgroud)
I note however that the two left-most bits seem to correspond to what I have expected; I feel am obviously failing to comprehend something fundamental in how to interpret bits...
RFC 说:
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
Run Code Online (Sandbox Code Playgroud)
一个字节的最低有效位是 0 位。最高有效位是 7 位。因此,您绘制的用于将 CM 和 CINFO 映射到位的图表是反向的。0x78并且0x68两者的CM都是8。他们的CINFO分别是7和6。
CINFO 是 RFC 所说的:
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size).
Run Code Online (Sandbox Code Playgroud)
因此,CINFO 为 7 意味着 32 KiB 窗口。6 表示 16 KiB。CINFO == 0并不能代表没有压缩。这意味着窗口大小为 256 字节。
对于标志字节,你又把它倒回去了。FDICT未设置。对于您的两个示例,压缩级别为11,最大压缩。
| 归档时间: |
|
| 查看次数: |
1340 次 |
| 最近记录: |