为什么ProtoBuf.NET的GZip大于制表符分隔值文件的GZip?

Joa*_*rel 1 gzip protocol-buffers protobuf-net

我们最近比较了使用ProtoBuf.NET或TSV(制表符分隔数据)序列化的相同表格数据(包括单表,六列,描述产品目录)的相应文件大小,两个文件随后用GZip压缩(默认的.NET实现).

令我惊讶的是,压缩的ProtoBuf.NET版本占用了比文本版本更多的空间(多达3倍). 我的宠物理论是ProtoBuf不尊重byte语义,因此不匹配GZip频率压缩树; 因此压缩效率相对较低.

另一种可能性是,ProtoBuf实际上编码了更多的数据(例如,为了便于模式版本控制),因此序列化格式在信息方面不具有严格的可比性.

有谁观察到同样的问题?压缩ProtoBuf甚至值得吗?

Mar*_*ell 5

这里可能有很多因素; 首先,请注意协议缓冲线格式对字符串使用直接UTF-8编码; 如果数据由字符串控制,它最终将需要与TSV相同的空间量.

协议缓冲区还旨在帮助存储结构化数据,即单表方案中更复杂的模型.这对于大小没有太大影响,但是开始与xml/json等比较(在能力方面更相似),差异更明显.

此外,由于协议缓冲区非常密集(尽管是UTF-8),在某些情况下压缩它实际上可以使它更大 - 您可能想要检查这是否是这种情况.

在您提供的场景的快速示例中,两种格式都提供了大致相同的大小 - 没有大规模跳转:

protobuf-net, no compression: 2498720 bytes, write 34ms, read 72ms, chk 50000
protobuf-net, gzip: 1521215 bytes, write 234ms, read 146ms, chk 50000
tsv, no compression: 2492591 bytes, write 74ms, read 122ms, chk 50000
tsv, gzip: 1258500 bytes, write 238ms, read 169ms, chk 50000
Run Code Online (Sandbox Code Playgroud)

在这种情况下,tsv略小,但最终TSV确实是一种非常简单的格式(结构化数据方面的能力非常有限),因此它很快就不足为奇了.

确实; 如果您要存储的只是一个非常简单的单表,TSV并不是一个糟糕的选择 - 但是,它最终是一种非常有限的格式.我不能重现你的"更大"的例子.

除了对结构化数据(和其他功能)的更丰富支持之外,protobuf也非常重视处理性能.现在,由于TSV非常简单,这里的边缘不会很大(但在上面很明显),但是再次:与xml,json或内置的BinaryFormatter对比,用于测试具有相似特征的格式,差别很明显.


上面数字的示例(更新为使用BufferedStream):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.Text;
using ProtoBuf;
static class Program
{
    static void Main()
    {
        RunTest(12345, 1, new StringWriter()); // let everyone JIT etc
        RunTest(12345, 50000, Console.Out); // actual test
        Console.WriteLine("(done)");
        Console.ReadLine();
    }
    static void RunTest(int seed, int count, TextWriter cout)
    {

        var data = InventData(seed, count);

        byte[] raw;
        Catalog catalog;
        var write = Stopwatch.StartNew();
        using(var ms = new MemoryStream())
        {
            Serializer.Serialize(ms, data);
            raw = ms.ToArray();
        }
        write.Stop();

        var read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        {
            catalog = Serializer.Deserialize<Catalog>(ms);
        }
        read.Stop();

        cout.WriteLine("protobuf-net, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())   
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress, true))
            using (var bs = new BufferedStream(gzip, 64 * 1024))
            {
                Serializer.Serialize(bs, data);
            } // need to close gzip to flush it (flush doesn't flush)
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        {
            catalog = Serializer.Deserialize<Catalog>(gzip);
        }
        read.Stop();

        cout.WriteLine("protobuf-net, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var writer = new StreamWriter(ms))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using (var ms = new MemoryStream(raw))
        using (var reader = new StreamReader(ms))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();

        cout.WriteLine("tsv, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress))
            using(var bs = new BufferedStream(gzip, 64 * 1024))
            using(var writer = new StreamWriter(bs))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        using(var reader = new StreamReader(gzip))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();

        cout.WriteLine("tsv, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
    }

    private static Catalog ReadTsv(StreamReader reader)
    {
        string line;
        List<Product> list = new List<Product>();
        while((line = reader.ReadLine()) != null)
        {
            string[] parts = line.Split('\t');
            var row = new Product();
            row.Id = int.Parse(parts[0]);
            row.Name = parts[1];
            row.QuantityAvailable = int.Parse(parts[2]);
            row.Price = decimal.Parse(parts[3]);
            row.Weight = int.Parse(parts[4]);
            row.Sku = parts[5];
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
    private static void WriteTsv(Catalog catalog, StreamWriter writer)
    {
        foreach (var row in catalog.Products)
        {
            writer.Write(row.Id);
            writer.Write('\t');
            writer.Write(row.Name);
            writer.Write('\t');
            writer.Write(row.QuantityAvailable);
            writer.Write('\t');
            writer.Write(row.Price);
            writer.Write('\t');
            writer.Write(row.Weight);
            writer.Write('\t');
            writer.Write(row.Sku);
            writer.WriteLine();
        }
    }
    static Catalog InventData(int seed, int count)
    {
        string[] lipsum =
            @"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
                .Split(' ');
        char[] skuChars = "0123456789abcdef".ToCharArray();
        Random rand = new Random(seed);
        var list = new List<Product>(count);
        int id = 0;
        for (int i = 0; i < count; i++)
        {
            var row = new Product();
            row.Id = id++;
            var name = new StringBuilder(lipsum[rand.Next(lipsum.Length)]);
            int wordCount = rand.Next(0,5);
            for (int j = 0; j < wordCount; j++)
            {
                name.Append(' ').Append(lipsum[rand.Next(lipsum.Length)]);
            }
            row.Name = name.ToString();
            row.QuantityAvailable = rand.Next(1000);
            row.Price = rand.Next(10000)/100M;
            row.Weight = rand.Next(100);
            char[] sku = new char[10];
            for(int j = 0 ; j < sku.Length ; j++)
                sku[j] = skuChars[rand.Next(skuChars.Length)];
            row.Sku = new string(sku);
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
}
[ProtoContract]
public class Catalog
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<Product> Products { get; set; } 
}
[ProtoContract]
public class Product
{
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public string Name { get; set; }
    [ProtoMember(3)]
    public int QuantityAvailable { get; set;}
    [ProtoMember(4)]
    public decimal Price { get; set; }
    [ProtoMember(5)]
    public int Weight { get; set; }
    [ProtoMember(6)]
    public string Sku { get; set; }
}
Run Code Online (Sandbox Code Playgroud)