压缩一组大整数

doc*_*doc 14 compression integer

我有一组整数,我希望有一个最紧凑的表示.我有以下约束/功能:

  • 它被设置,或换句话说,一个唯一整数的列表,其中顺序无关紧要
  • 集合L的大小相对较小(通常为1000个元素)
  • 整数遵循0和N-1之间的均匀分布,N相对较大(比如2 ^ 32)
  • 对压缩集的元素的访问是随机的,但如果解压缩过程不那么快就可以了
  • 显然,压缩应该是无损的

我尝试了一些事情,但我对结果不满意,并且我确信存在更好的解决方案:

  • delta编码(排序,然后编码差异),或者也排序,然后编码第i个元素和i*N/L之间的差异.两者都给出了合理的结果,但不是很好,可能是因为N和L的典型大小.编码增量的霍夫曼没有帮助,因为它们通常很大.
  • 递归范围缩减(http://ygdes.com/ddj-3r/ddj-3r_compact.html).这看起来很聪明,但在指数减少的整数上效果最好,这绝对不是这里的情况.
  • 关于stackoverflow的一些讨论类似但不完全等同于我的问题(C库用于压缩顺序正整数,压缩排序整数)

我很高兴听到你可能有任何想法.提前致谢!

更新:

事实证明,delta编码似乎接近最优解.对于集合中元素的其他其他分布,这可能不同.

Mar*_*ler 10

你可以通过数数来了解你能做的最好的事情.(我希望stackoverflow允许TeX方程式像math.stackexchange.无论如何......)

ceiling(log(Combination(2^32,1000)) / (8 * log(2))) = 2934
Run Code Online (Sandbox Code Playgroud)

因此,如果正如您所说的那样,选择是均匀分布的,那么对于该特定情况,您可能希望平均的最佳压缩是2934字节.最佳比率是4000字节的未编码表示的73.35%.

Combination(2^32,1000)只是压缩算法可能输入的总数.如果它们是均匀分布的,那么最佳编码是一个巨大的整数,它通过索引标识每个可能的输入.每个巨型整数值唯一地标识其中一个输入.想象一下在巨大的表格中按索引查找输入. ceiling(log(Combination(2^32,1000)) / log(2))是指索引整数所需的位数.

更新:

我找到了一种使用现成的压缩工具来接近理论最佳的方法.我排序,应用delta编码,并从中减去一个(因为连续的不同元素之间的差值至少为1).然后诀窍是我写出所有高字节,然后是下一个最重要的字节,等等.增量的高字节减去一个往往是零,所以将很多零组合在一起,这是标准的压缩实用程序所喜欢的.此外,下一组字节往往偏向低值.

对于该示例(来自0..2 ^ 32-1的1000个统一且不同的样本),在通过时获得平均3110个字节gzip -9,并且通过xz -9(xz使用相同的压缩,LZMA,7zip)获得3098个字节.这些非常接近2934的理论最佳平均值.Gzip也有18字节的开销,而xz的开销为24字节,包括头文件和预告片.因此,与理论上最好的比较将是3092 gzip -9和3074 xz -9.比理论上最好的大约5%.

更新2:

我实现了排列的直接编码,平均达到了2974个字节,比理论上最好的只有1%多一点.我使用GNU多精度算术库来编码一个巨大整数中每个排列的索引.编码和解码的实际代码如下所示.我为这些mpz_*函数添加了注释,从名称中可能并不明显它们正在做什么算术运算.

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}
Run Code Online (Sandbox Code Playgroud)

有一些mpz_*功能可以将数字写入文件并将其读回,或将数字导出到内存中的指定格式,然后将其导回.