了解 zlib 头文件;CMF (CM, CINFO), FLG, (FDICT/DICTID, FLEVEL);RFC1950 § 2.2。数据格式

use*_*827 2 pdf zlib python-3.x

我对 zlib 数据格式很好奇,并试图理解 RFC1950 中描述的 zlib 标头(https://tools.ietf.org/html/rfc1950 ) 中。然而,我对这种低级解释很陌生,似乎与我的一些结论相冲突。

我有以下压缩数据(来自 PDF 流对象):

b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
Run Code Online (Sandbox Code Playgroud)

在python中,我已经成功解压并重新压缩了数据:

b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
Run Code Online (Sandbox Code Playgroud)

正如我所了解的Deflate 和 inflate for PDF 中的讨论/答案,使用 zlib C++ 压缩数据的结果差异应该无关紧要,因为它是不同应用方法压缩数据的影响。

假设最后四个字节!\xa4\x03\xc4是 ADLER32(Adler-32 校验和),我的问题与前 2 个字节有关。

  0   1     0   1   2   3                             0   1   2   3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| |    [DICTID]   | |...compressed data...| |    ADLER32    |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
Run Code Online (Sandbox Code Playgroud)

CMF

第一个字节代表 CMF,在我的两个实例中是

  • chr h = dec 104 = hex 68 = 01101000
  • chr x = dec 120 = hex 78 = 01111000

该字节根据压缩方法分为 4 位压缩方法和 4 位信息字段。

  • 位 0 到 3 CM 压缩方法

  • 位 4 到 7 CINFO 压缩信息

+----|----+      +----|----+     +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+      +----|----+     +----|----+
  CM |CINFO        CM |CINFO       CM |CINFO
Run Code Online (Sandbox Code Playgroud)

在哪里

[CM] 标识文件中使用的压缩方法。CM = 8 表示“deflate”压缩方法,窗口大小高达 >32K。这是 gzip 和 PNG 使用的方法(见 CM = 15 保留。

对于 CM = 8,CINFO 是 LZ77 窗口大小的以 2 为底的对数减去 8(CINFO=7 表示 32K 窗口大小)。此版本的规范中不允许 CINFO 值大于 7。对于不等于 8 的 CM,本规范中未定义 CINFO。

据我了解,

  • 唯一有效的 CM 是 8
  • CINFO 可以是 0-7

参见/sf/answers/2444841381/

你不应该假设它总是 8。相反,你应该检查它,如果它不是 8,则抛出“不支持”错误。

参见https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ

zlib 头文件的所有 64 种当前可能性的详尽列表:

COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d   18 19   28 15   38 11   48 0d   58 09   68 05
08 5b   18 57   28 53   38 4f   48 4b   58 47   68 43
08 99   18 95   28 91   38 8d   48 89   58 85   68 81
08 d7   18 d3   28 cf   38 cb   48 c7   58 c3   68 de
VERY RARE
08 3c   18 38   28 34   38 30   48 2c   58 28   68 24   78 3f
08 7a   18 76   28 72   38 6e   48 6a   58 66   68 62   78 7d
08 b8   18 b4   28 b0   38 ac   48 a8   58 a4   68 bf   78 bb
08 f6   18 f2   28 ee   38 ea   48 e6   58 e2   68 fd   78 f9
Run Code Online (Sandbox Code Playgroud)

Q1 我的第一个问题很简单

  • 为什么CINFO在CM之前?,即,
  • 为什么不是 87, 80, 81, 82, 83, ...

据我所知,字节顺序在这里不是问题。我怀疑它可能与最低有效(RFC1950 § 2.1。总体约定)有关,但我不太明白它会如何导致,例如,78 而不是 87 ......

Q2 我的第二个问题

  • 如果 CINFO 7 代表“一个窗口大小高达 32K”,那么 1-6 对应的是什么?(假设 0 表示窗口大小为 0,如未应用压缩)。

FLG

第二个字节代表 FLG

\xde -> 11011110
\xda -> 11011010
Run Code Online (Sandbox Code Playgroud)

[FLG] [...] 划分如下:

  • 位 0 到 4 FCHECK(CMF 和 FLG 的检查位)

  • 位 5 FDI​​CT(预设字典)

  • 位 6 到 7 FLEVEL(压缩级别)

+-----|-|--+      +-----|-|--+     +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+      +-----|-|--+     +-----|-|--+
   C  |D| L          C  |D| L         C  |D| L
Run Code Online (Sandbox Code Playgroud)

Bit 0-4 as far as I can tell is some form of "checksum" or integrity control?

Bit 5 indicate whether a dictionary is present.

FDICT (Preset dictionary) If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.

Q3 My third question

Assuming that "1" indicates "is set"

\xde -> 11011_1_10
\xda -> 11011_0_10
Run Code Online (Sandbox Code Playgroud)

According to the specification DICTID consist of 4 bytes. The four following bytes in the compressed streams I have are

bbd\x10
cbd\x10
Run Code Online (Sandbox Code Playgroud)

Why are the compressed data from the PDF stream object (with the FDICT 1) and the compressed data with python zlib (with the FDICT 0) almost identical?

Granted that I do not understand the function of the DICTID, but is it not supposed to exist only if FDICT is set?

Q4 My fourth question

Bit 6-7 sets the FLEVEL (Compression level)

These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:

0 - compressor used fastest algorithm

1 - compressor used fast algorithm

2 - compressor used default algorithm

3 - compressor used maximum compression, slowest algorithm

The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.

I would have thought that the flags would be:

0 (00)
1 (01)
2 (10)
3 (11)
Run Code Online (Sandbox Code Playgroud)

However from the What does a zlib header look like?

01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
Run Code Online (Sandbox Code Playgroud)

I note however that the two left-most bits seem to correspond to what I have expected; I feel am obviously failing to comprehend something fundamental in how to interpret bits...

Mar*_*ler 5

RFC 说:

CMF (Compression Method and flags)
         This byte is divided into a 4-bit compression method and a 4-
         bit information field depending on the compression method.

            bits 0 to 3  CM     Compression method
            bits 4 to 7  CINFO  Compression info
Run Code Online (Sandbox Code Playgroud)

一个字节的最低有效位是 0 位。最高有效位是 7 位。因此,您绘制的用于将 CM 和 CINFO 映射到位的图表是反向的。0x78并且0x68两者的CM都是8。他们的CINFO分别是7和6。

CINFO 是 RFC 所说的:

CINFO (Compression info)
   For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
   size, minus eight (CINFO=7 indicates a 32K window size).
Run Code Online (Sandbox Code Playgroud)

因此,CINFO 为 7 意味着 32 KiB 窗口。6 表示 16 KiB。CINFO == 0并不能代表没有压缩。这意味着窗口大小为 256 字节。

对于标志字节,你又把它倒回去了。FDICT设置。对于您的两个示例,压缩级别为11,最大压缩。