压缩具有特定顺序的正整数向量 (int32)

Question

压缩具有特定顺序的正整数向量 (int32)

use*_*575 11 compression algorithm integer bit-manipulation

我正在尝试压缩长向量（它们的大小范围从 1 到 1 亿个元素）。向量具有正整数，其值范围从 0 到 1 或 1 亿（取决于向量大小）。因此，我使用 32 位整数来包含大数字，但这会消耗太多存储空间。这些向量具有以下特征：

所有值都是正整数。它们的范围随着向量大小的增长而增长。
值在增加，但较小的数字确实经常干预（见下图）。
特定索引之前的值都不大于该索引（索引从零开始）。例如，索引 6 之前出现的值都不大于 6。但是，较小的值可能会在该索引之后重复。这适用于整个阵列。
我通常处理很长的数组。因此，当数组长度超过 100 万个元素时，即将出现的数字大多是与先前重复出现的数字混合的大数字。较短的数字通常比较大的数字更频繁地重新出现。当您通过数组时，新的更大的数字会添加到数组中。

以下是数组中的值示例：{initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later ..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}

这是向量的一部分的图：

我想要什么？：因为我使用的是 32 位整数，这会浪费大量内存，因为可以用小于 32 位表示的较小数字也会重复。我想最大限度地压缩这个向量以节省内存（理想情况下，减少 3 倍，因为只有减少这个数量或更多才能满足我们的需求！）。实现这一目标的最佳压缩算法是什么？或者是否可以利用上述数组的特征将该数组中的数字可逆地转换为 8 位整数？

我尝试过或考虑过的事情：

Delta 编码：这在这里不起作用，因为矢量并不总是增加。
霍夫曼编码：这里似乎没有帮助，因为数组中唯一数字的范围非常大，因此，编码表将是一个很大的开销。
使用变量 Int 编码。即对较小的数字使用 8 位整数，对较大的数字使用 16 位......等等。这将向量大小减小到 size*0.7（不令人满意，因为它没有利用上述特定特性）
我不太确定以下链接中描述的这种方法是否适用于我的数据：http: //ygdes.com/ddj-3r/ddj-3r_compact.html 我不太了解该方法，但它给了我鼓励尝试类似的事情，因为我认为可以利用数据中的某些顺序来发挥它的优势。例如，我尝试将任何大于 255 的数字（n）重新分配给 n-255，以便我可以将整数保留在 8 位领域中，因为我知道在该索引之前没有任何数字大于 255。但是，我无法区分重新分配的数字和重复的数字......所以这个想法不起作用，除非做一些更多的技巧来扭转重新分配......

这是感兴趣的人的数据的前 24000 个元素的链接：数据

任何意见或建议深表感谢。非常感谢。

编辑1：

这是增量编码后的数据图。如您所见，它不会减少范围！

编辑2：

我希望我能在数据中找到一种模式，允许我可逆地将 32 位向量更改为单个 8 位向量，但这似乎不太可能。我尝试将 32 位向量分解为 4 x 8 位向量，希望分解后的向量能更好地进行压缩。下面是 4 个向量的图。现在它们的范围是 0-255。我所做的是递归地将向量中的每个元素除以 255 并将提醒存储到另一个向量中。要重建原始数组，我需要做的就是： ( ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...

如您所见，对于当前显示的数据长度，最后一个向量全为零..实际上，这应该一直为零到第 2^24 个元素。如果我的向量总长度小于 1600 万个元素，这将减少 25%，但由于我处理的是更长的向量，因此影响要小得多。更重要的是，第三个向量似乎也有一些可压缩的特征，因为它的值在每 65,535 步后增加 1。现在看来，我可以按照建议从霍夫曼编码或可变位编码中受益。任何能让我最大限度地压缩这些数据的建议都深表感谢。如果有人感兴趣，我在这里附上了一个更大的数据样本：

https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing

编辑3：

我真的很感谢所有给出的答案。我从他们身上学到了很多。对于那些有兴趣修改更大数据集的人，以下链接包含类似数据集的 1100 万个元素（压缩 33MB）

https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view

解压缩数据后，您可以使用以下 C++ 代码段将数据读入 vector<int32_t>

    const char* path = "path_to\compression_int32.txt";
    std::vector<int32_t> newVector{};
    std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
    std::istream_iterator<int32_t> iter{ ifs };
    std::istream_iterator<int32_t> end{};
    std::copy(iter, end, std::back_inserter(newVector));

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mar*_*ler 6

通过使用属性 3，很容易在示例数据上获得比两倍压缩更好的效果，其中我采用属性 3 表示每个值必须小于其索引，索引从 1 开始。只需使用Ceiling(log ₂ ( i )) 位用于存储索引i处的数字（其中i从 1 开始）。对于第一个包含 24,977 个值的示例，使用 32 位整数将其压缩为向量大小的 43%。

\n

所需的位数仅取决于向量的长度n。位数为：

\n

1 - 2^{上限(log ₂ ( n ))} + n上限(log ₂ ( n ))

\n

_{正如 Falk H\xc3\xbcffner 所指出的，一种更简单的方法是对所有上限值（log 2} ( n )）使用固定位数。可变位数总是小于该值，但不会比大n的值少很多。

\n

如果开始时出现一系列零是很常见的，那么用计数来压缩它们。余数中只有少数两个或三个数字的游程，因此除了初始的零游程之外，游程长度编码不会有帮助。

\n

另外 2% 左右（对于大型集合）可以使用算术编码方法来削减，将索引k（从零开始的索引）处的每个值视为非常大整数的基k+1数字。这将需要上限(log ₂ ( n !)) 位。

\n

下面是算术编码、每个样本可变位编码和每个样本固定位编码的压缩比图，所有这些都与每个样本 32 位的表示成比例（序列长度采用对数尺度）：

\n

算术方法需要对压缩数据长度的整数进行乘法和除法，这对于大型向量来说非常慢。下面的代码将整数的大小限制为 64 位，但以压缩比为代价，以换取速度非常快。该代码的压缩率比上图中算术的压缩率高出约 0.2% 到 0.7%，远低于可变位。数据向量必须具有以下属性：每个值都是非负\n并且每个值都小于其位置（位置从 1 开始）。\n压缩效果仅取决于该属性，如果存在初始值，则压缩效果会略有减少连续的零。\n所提供的示例中似乎存在此压缩方法未利用的更多冗余。

\n

#include <vector>\n#include <cmath>\n\n// Append val, as a variable-length integer, to comp. val must be non-negative.\ntemplate <typename T>\nvoid write_varint(T val, std::vector<uint8_t>& comp) {\n    while (val > 0x7f) {\n        comp.push_back(val & 0x7f);\n        val >>= 7;\n    }\n    comp.push_back(val | 0x80);\n}\n\n// Return the variable-length integer at offset off in comp, updating off to\n// point after the integer.\ntemplate <typename T>\nT read_varint(std::vector<uint8_t> const& comp, size_t& off) {\n    T val = 0, next;\n    int shift = 0;\n    for (;;) {\n        next = comp.at(off++);\n        if (next > 0x7f)\n            break;\n        val |= next << shift;\n        shift += 7;\n    }\n    val |= (next & 0x7f) << shift;\n    return val;\n}\n\n// Given the starting index i >= 1, find the optimal number of values to code\n// into 64 bits or less, or up through index n-1, whichever comes first.\n// Optimal is defined as the least amount of entropy lost by representing the\n// group in an integral number of bits, divided by the number of bits. Return\n// the optimal number of values in num, and the number of bits needed to hold\n// an integer representing that group in len.\nstatic void group_ar64(size_t i, size_t n, size_t& num, int& len) {\n    // Analyze all of the permitted groups, starting at index i.\n    double min = 1.;\n    uint64_t k = 1;                 // integer range is 0..k-1\n    auto j = i + 1;\n    do {\n        k *= j;\n        auto e = log2(k);           // entropy of k possible integers\n        int b = ceil(e);            // number of bits to hold 0..k-1\n        auto loss = (b - e) / b;    // unused entropy per bit\n        if (loss < min) {\n            num = j - i;            // best number of values so far\n            len = b;                // bit length for that number\n            if (loss == 0.)\n                break;              // not going to get any better\n            min = loss;\n        }\n    } while (j < n && k <= (uint64_t)-1 / ++j);\n}\n\n// Compress the data arithmetically coded as an incrementing base integer, but\n// with a 64-bit limit on each integer. This puts values into groups that each\n// fit in 64 bits, with the least amount of wasted entropy. Also compress the\n// initial run of zeros into a count.\ntemplate <typename T>\nstd::vector<uint8_t> compress_ar64(std::vector<T> const& data) {\n    // Resulting compressed data vector.\n    std::vector<uint8_t> comp;\n\n    // Start with number of values to make the stream self-terminating.\n    write_varint(data.size(), comp);\n    if (data.size() == 0)\n        return comp;\n\n    // Run-length code the initial run of zeros. Write the number of contiguous\n    // zeros after the first one.\n    size_t i = 1;\n    while (i < data.size() && data[i] == 0)\n        i++;\n    write_varint(i - 1, comp);\n\n    // Compress the data into variable-base integers starting at index i, where\n    // each integer fits into 64 bits.\n    unsigned buf = 0;       // output bit buffer\n    int bits = 0;           // number of bits in buf (0..7)\n    while (i < data.size()) {\n        // Find the optimal number of values to code, starting at index i.\n        size_t num;  int len;\n        group_ar64(i, data.size(), num, len);\n\n        // Code num values.\n        uint64_t code = 0;\n        size_t k = 1;\n        do {\n            code += k * data[i++];\n            k *= i;\n        } while (--num);\n\n        // Write code using len bits.\n        if (bits) {\n            comp.push_back(buf | (code << bits));\n            code >>= 8 - bits;\n            len -= 8 - bits;\n        }\n        while (len > 7) {\n            comp.push_back(code);\n            code >>= 8;\n            len -= 8;\n        }\n        buf = code;\n        bits = len;\n    }\n    if (bits)\n        comp.push_back(buf);\n    return comp;\n}\n\n// Decompress the result of compress_ar64(), returning the original values.\n// Start decompression at offset off in comp. When done, off is updated to\n// point just after the compressed data.\ntemplate <typename T>\nstd::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {\n    // Will contain the uncompressed data to return.\n    std::vector<T> data;\n\n    // Get the number of values.\n    auto vals = read_varint<size_t>(comp, off);\n    if (vals == 0)\n        return data;\n\n    // Get the number of zeros after the first one, and write all of them.\n    auto run = read_varint<size_t>(comp, off) + 1;\n    auto i = run;\n    do {\n        data.push_back(0);\n    } while (--run);\n\n    // Extract the values from the compressed data starting at index i.\n    unsigned buf = 0;       // input bit buffer\n    int bits = 0;           // number of bits in buf (0..7)\n    while (i < vals) {\n        // Find the optimal number of values to code, starting at index i. This\n        // simply repeats the same calculation that was done when compressing.\n        size_t num;  int len;\n        group_ar64(i, vals, num, len);\n\n        // Read len bits into code.\n        uint64_t code = buf;\n        while (bits + 8 < len) {\n            code |= (uint64_t)comp.at(off++) << bits;\n            bits += 8;\n        }\n        len -= bits;                    // bits to pull from last byte (1..8)\n        uint64_t last = comp.at(off++); // last byte\n        code |= (last & ((1 << len) - 1)) << bits;\n        buf = last >> len;              // save remaining bits in buffer\n        bits = 8 - len;\n\n        // Extract num values from code.\n        do {\n            i++;\n            data.push_back(code % i);\n            code /= i;\n        } while (--num);\n    }\n\n    // Return the uncompressed data.\n    return data;\n}\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 2

Spe*_*tre 5

正如我在评论中建议的，您可以将数据表示为 8 位。有一些简单的方法可以有效地做到这一点，而不需要模块化算术。

您可以为此使用联合或指针，例如在 C++ 中，如果您有：

unsigned int data32[]={0,0,0,...};
unsigned char *data08=data32;

Run Code Online (Sandbox Code Playgroud)

或者您可以将其复制到 4 BYTE 数组，但这会更慢。

如果您出于任何原因必须使用模算术，那么您可能需要这样做：

 x     &255
(x>> 8)&255
(x>>16)&255
(x>>24)&255

Run Code Online (Sandbox Code Playgroud)

现在我已经在你的新数据上尝试了LZW，没有任何数据重新排序（单个LZW）的压缩率结果是81-82％（取决于字典大小，我建议使用10位LZW字典），这并不像预期的那么好。因此，我将数据重新排序为 4 个数组（就像您所做的那样），因此第一个数组具有最低 8 位，最后一个数组具有最高 8 位。12 位字典的结果其中：

ratio08: 144%
ratio08: 132%
ratio08:  26%
ratio08:   0%
total:    75%

Run Code Online (Sandbox Code Playgroud)

10 位字典的结果其中：

ratio08: 123%
ratio08: 117%
ratio08:  28%
ratio08:   0%
total:    67%

Run Code Online (Sandbox Code Playgroud)

表明 LZW 对于最低字节不利（随着大小的增加，对于较高字节也会更糟），因此仅将其用于较高字节，这将进一步提高压缩率。

不过，我预计霍夫曼应该会带来更好的结果，因此我计算了您的数据的熵：

H32 = 5.371071 , H/8 = 0.671384
H08 = 7.983666 , H/8 = 0.997958
H08 = 7.602564 , H/8 = 0.950321
H08 = 1.902525 , H/8 = 0.237816
H08 = 0.000000 , H/8 = 0.000000
total: 54%

Run Code Online (Sandbox Code Playgroud)

这意味着朴素的单个霍夫曼编码的压缩比为 67%，而单独的 4 个数组的压缩率为 54%，这要好得多，所以在你的情况下，我会选择霍夫曼编码。我在这里实现后结果：

[Huffman]
ratio08 =  99.992%
ratio08 =  95.400%
ratio08 =  24.706%
ratio08 =   0.000%
total08 =  55.025%
ratio32 =  67.592%

Run Code Online (Sandbox Code Playgroud)

这与预期的香农熵的估计非常匹配（不考虑解码表）......

然而，对于非常大的数据集，我预计朴素的霍夫曼将开始比单独的 4x 霍夫曼稍微好一些......

另请注意，结果被截断，因此 0% 不是零，而是小于 1% ...

[编辑1] 300 000 000 个条目估计

因此，为了模拟您的 300M 32 位数字的条件，我使用具有类似“空白空间”属性的数据的 16 位数字子部分。

log2(300 000 000) = ~28
28/32 * 16 = 14

Run Code Online (Sandbox Code Playgroud)

所以我只使用 2^14 16 位数字，它应该具有与 300M 32 位数字类似的属性 8 位霍夫曼编码导致：

ratio08 =  97.980%
ratio08 =  59.534%
total08 =  78.757%

Run Code Online (Sandbox Code Playgroud)

所以我估计编码/解码大小之间的比率为 80% ~1.25 大小减少。（希望我没有把我的假设搞砸）。

Answer 3

The*_*Way 5

解决每个压缩问题都应该从分析开始。

我查看了包含前 24976 个值的原始数据文件。最小值为 0，最大值为 24950。数据的“斜率”约为 1。但是，如果最大值如前所述仅为 33M@100M 值，则它应该随着时间的推移而减小。假设斜率=1 就有点悲观了。

至于分布，

tr '[,]' '[\n]' <compression.txt | sort -n | uniq -c | sort -nr | head -n256

Run Code Online (Sandbox Code Playgroud)

产生

Run Code Online (Sandbox Code Playgroud)

作为最常见的 256 个值。

似乎有些价值观本质上更常见。经过检查，这些共同值似乎也分布在整个数据中。

我提出以下建议：

将数据分成块。对于每个块，发送斜率的实际值，因此在对每个符号进行编码时我们知道其最大值。

使用统计编码（霍夫曼等）对块中的公共值进行编码。在本例中，字母表为 256 的截止值约为 17 次出现。

对于不太常见的值，我们保留字母表的一小部分用于发送值中的位数。

当我们遇到一个稀有值时，它的位会在没有统计建模的情况下进行编码。最上面的位可以省略，因为我们知道它始终为 1（除非值为“0”）。

通常要编码的值的范围不是2的幂。例如，如果我们有 10 个选择，则需要 4 位进行编码，但有 6 个未使用的位模式 - 有时我们只需要 3 位。前 6 个选择我们直接用 3 位进行编码。如果是 7 或 8，我们会发送一个额外的位来指示我们是指 9 还是 10。

此外，我们可以排除任何直接从可能值列表中编码的值。否则我们有两种方法来编码相同的值，这是多余的。

归档时间：	5 年前
查看次数：	594 次
最近记录：	5 年前