使用预先计算的字典在小块中进行无损压缩

Fan*_*ius 16 compression algorithm

我有一个应用程序,我正在阅读和写入数小块数据(几百字节)数亿次.我想基于示例数据文件生成压缩字典,并在读取和写入小块时永远使用该字典.我倾向于LZW压缩算法.维基百科页面(http://en.wikipedia.org/wiki/Lempel-Ziv-Welch)列出了压缩和解压缩的伪代码.修改它看起来相当简单,因此字典创建是一个单独的代码块.所以我有两个问题:

  1. 我是在正确的轨道还是有更好的方法?
  2. 为什么LZW算法在解压缩步骤中会添加到字典中?我可以省略它,还是会在字典中失去效率?

谢谢.

更新: 现在我认为理想的情况是找到一个库,让我将字典与压缩数据分开存储.这样的事情存在吗?

更新: 我最终获取了http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c中的代码 并进行了调整.我是该页面评论中的克里斯.我通过电子邮件将我的mod发回给了博客作者,但我还没有收到回复.我用这个代码看到的压缩率并不令人印象深刻.也许这是由于8位树的大小.

更新: 我将其转换为16位,压缩效果更好.它也比原始代码快得多.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace Book.Core
{
  public class Huffman16
  {
    private readonly double log2 = Math.Log(2);

    private List<Node> HuffmanTree = new List<Node>();

    internal class Node
    {
      public long Frequency { get; set; }
      public byte Uncoded0 { get; set; }
      public byte Uncoded1 { get; set; }
      public uint Coded { get; set; }
      public int CodeLength { get; set; }
      public Node Left { get; set; }
      public Node Right { get; set; }

      public bool IsLeaf
      {
        get { return Left == null; }
      }

      public override string ToString()
      {
        var coded = "00000000" + Convert.ToString(Coded, 2);
        return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
      }
    }

    public Huffman16(long[] frequencies)
    {
      if (frequencies.Length != ushort.MaxValue + 1)
      {
        throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
      }
      BuildTree(frequencies);
      EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
    }

    public static long[] GetFrequencies(byte[] sampleData, bool safe)
    {
      if (sampleData.Length % 2 != 0)
      {
        throw new ArgumentException("sampleData.Length must be a multiple of 2.");
      }
      var histogram = new long[ushort.MaxValue + 1];
      if (safe)
      {
        for (int i = 0; i <= ushort.MaxValue; i++)
        {
          histogram[i] = 1;
        }
      }
      for (int i = 0; i < sampleData.Length; i += 2)
      {
        histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
      }
      return histogram;
    }

    public byte[] Encode(byte[] plainData)
    {
      if (plainData.Length % 2 != 0)
      {
        throw new ArgumentException("plainData.Length must be a multiple of 2.");
      }

      Int64 iBuffer = 0;
      int iBufferCount = 0;

      using (MemoryStream msEncodedOutput = new MemoryStream())
      {
        //Write Final Output Size 1st
        msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);

        //Begin Writing Encoded Data Stream
        iBuffer = 0;
        iBufferCount = 0;
        for (int i = 0; i < plainData.Length; i += 2)
        {
          Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];

          //How many bits are we adding?
          iBufferCount += FoundLeaf.CodeLength;

          //Shift the buffer
          iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;

          //Are there at least 8 bits in the buffer?
          while (iBufferCount > 7)
          {
            //Write to output
            int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
            msEncodedOutput.WriteByte((byte)iBufferOutput);
            iBufferCount = iBufferCount - 8;
            iBufferOutput <<= iBufferCount;
            iBuffer ^= iBufferOutput;
          }
        }

        //Write remaining bits in buffer
        if (iBufferCount > 0)
        {
          iBuffer = iBuffer << (8 - iBufferCount);
          msEncodedOutput.WriteByte((byte)iBuffer);
        }
        return msEncodedOutput.ToArray();
      }
    }

    public byte[] Decode(byte[] bInput)
    {
      long iInputBuffer = 0;
      int iBytesWritten = 0;

      //Establish Output Buffer to write unencoded data to
      byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];

      var current = HuffmanTree[HuffmanTree.Count - 1];

      //Begin Looping through Input and Decoding
      iInputBuffer = 0;
      for (int i = 4; i < bInput.Length; i++)
      {
        iInputBuffer = bInput[i];

        for (int bit = 0; bit < 8; bit++)
        {
          if ((iInputBuffer & 128) == 0)
          {
            current = current.Left;
          }
          else
          {
            current = current.Right;
          }
          if (current.IsLeaf)
          {
            bDecodedOutput[iBytesWritten++] = current.Uncoded1;
            bDecodedOutput[iBytesWritten++] = current.Uncoded0;
            if (iBytesWritten == bDecodedOutput.Length)
            {
              return bDecodedOutput;
            }
            current = HuffmanTree[HuffmanTree.Count - 1];
          }
          iInputBuffer <<= 1;
        }
      }
      throw new Exception();
    }

    private static void EncodeTree(Node node, int depth, uint value)
    {
      if (node != null)
      {
        if (node.IsLeaf)
        {
          node.CodeLength = depth;
          node.Coded = value;
        }
        else
        {
          depth++;
          value <<= 1;
          EncodeTree(node.Left, depth, value);
          EncodeTree(node.Right, depth, value | 1);
        }
      }
    }

    private void BuildTree(long[] frequencies)
    {
      var tiny = 0.1 / ushort.MaxValue;
      var fraction = 0.0;

      SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
      for (int i = 0; i <= ushort.MaxValue; i++)
      {
        var leaf = new Node()
        {
          Uncoded1 = (byte)(i >> 8),
          Uncoded0 = (byte)(i & 255),
          Frequency = frequencies[i]
        };
        HuffmanTree.Add(leaf);
        if (leaf.Frequency > 0)
        {
          trees.Add(leaf.Frequency + (fraction += tiny), leaf);
        }
      }

      while (trees.Count > 1)
      {
        var e = trees.GetEnumerator();
        e.MoveNext();
        var first = e.Current;
        e.MoveNext();
        var second = e.Current;

        //Join smallest two nodes
        var NewParent = new Node();
        NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
        NewParent.Left = first.Value;
        NewParent.Right = second.Value;

        HuffmanTree.Add(NewParent);

        //Remove the two that just got joined into one
        trees.Remove(first.Key);
        trees.Remove(second.Key);

        trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
      }
    }

  }

}
Run Code Online (Sandbox Code Playgroud)

用法示例:

要从示例数据创建字典:

var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(@"D:\nodes"), true);
Run Code Online (Sandbox Code Playgroud)

要使用给定的字典初始化编码器:

var huff = new Huffman16(freqs);
Run Code Online (Sandbox Code Playgroud)

并做一些压缩:

var encoded = huff.Encode(raw);
Run Code Online (Sandbox Code Playgroud)

和减压:

var raw = huff.Decode(encoded);
Run Code Online (Sandbox Code Playgroud)

Ton*_*Lee 3

我认为最困难的部分是如何构建静态字典。您不想使用根据示例数据构建的 LZW 字典。LZW 浪费了大量的学习时间,因为它无法比解压缩器更快地构建字典(令牌只会在压缩器第二次看到它时使用,因此解压缩器可以在第一次看到它时将其添加到其字典中) 。另一方面是,它向字典中添加了可能永远不会被使用的内容,以防万一该字符串再次出现。(例如,要拥有“stackoverflow”的令牌,您还将拥有“ac”、“ko”、“ve”、“rf”等条目...)

然而,查看来自 LZ77 算法的原始令牌流可能效果很好。您只会看到至少出现两次的字符串的标记。然后,您可以构建要包含在字典中的最常见标记/字符串的列表。

一旦你有了静态字典,使用 LZW sans 字典更新似乎是一个简单的实现,但为了获得最佳压缩,我会考虑使用静态Huf​​fman表而不是传统的 12 位固定大小令牌(正如 George Phillips 所建议的那样)。LZW 字典将为您可能永远不会实际编码的所有子字符串烧毁标记(例如,如果您可以编码“stackoverflow”,则会有“st”、“sta”、“stac”、“stack”、“堆栈'等)。

在这一点上,它确实不是 LZW - LZW 的聪明之处在于解压缩器如何构建与压缩器仅查看压缩数据流所使用的相同字典。你不会使用的东西。但是所有 LZW 实现都有一个字典已满且不再更新的状态,这就是您将其与静态字典一起使用的方式。