C#添加元素时字典性能不佳

RBa*_*iak 8 c# dictionary list

我有大量的数据包含~150万条目.每个条目都是这样的类的实例:

public class Element
{
    public Guid ID { get; set; }
    public string name { get; set; }
    public property p... p1... p2... 
}
Run Code Online (Sandbox Code Playgroud)

我有一个Guids列表(约4百万),我需要根据Element类的实例列表获取名称.

我将Element对象存储在Dictionary中,但填充数据需要大约90秒.在向字典中添加项目时,有什么方法可以提高性能吗?数据没有重复,但我知道字典在添加新项目时会检查重复项.

如果有更好的结构,结构不需要是字典.我尝试将Element对象放在List中,在添加时表现更好(~9秒).但是当我需要使用某个Guid查找该项目时,需要超过10分钟才能找到所有400万个元素.我尝试使用List.Find()并手动迭代列表.

此外,如果不使用System.Guid,我将它们全部转换为String并将它们的字符串表示形式存储在数据结构上,整个填充字典和填充另一个列表上的名称的操作只需要10秒,但随后我的应用程序消耗1.2当我将它们存储为System.Guid时,RAM的Gb而不是600mb.

关于如何更好地执行它的任何想法?

xan*_*tos 8

您的问题可能与"顺序"有关Guid,例如:

c482fbe1-9f16-4ae9-a05c-383478ec9d11
c482fbe1-9f16-4ae9-a05c-383478ec9d12
c482fbe1-9f16-4ae9-a05c-383478ec9d13
c482fbe1-9f16-4ae9-a05c-383478ec9d14
c482fbe1-9f16-4ae9-a05c-383478ec9d15
Run Code Online (Sandbox Code Playgroud)

Dictionary<,>有问题,因为它们经常有相同的GetHashCode(),所以它必须做一些技巧,将搜索时间从... O(1)改为O(n)...你可以通过使用自定义的相等比较器来解决它,以不同的方式计算哈希值,喜欢:

public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
    public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();

    #region IEqualityComparer<Guid> Members

    public bool Equals(Guid x, Guid y)
    {
        return x.Equals(y);
    }

    public int GetHashCode(Guid obj)
    {
        var bytes = obj.ToByteArray();

        uint hash1 = (uint)bytes[0] | ((uint)bytes[1] << 8) | ((uint)bytes[2] << 16) | ((uint)bytes[3] << 24);
        uint hash2 = (uint)bytes[4] | ((uint)bytes[5] << 8) | ((uint)bytes[6] << 16) | ((uint)bytes[7] << 24);
        uint hash3 = (uint)bytes[8] | ((uint)bytes[9] << 8) | ((uint)bytes[10] << 16) | ((uint)bytes[11] << 24);
        uint hash4 = (uint)bytes[12] | ((uint)bytes[13] << 8) | ((uint)bytes[14] << 16) | ((uint)bytes[15] << 24);

        int hash = 37;

        unchecked
        {
            hash = hash * 23 + (int)hash1;
            hash = hash * 23 + (int)hash2;
            hash = hash * 23 + (int)hash3;
            hash = hash * 23 + (int)hash4;
        }

        return hash;
    }

    #endregion
}
Run Code Online (Sandbox Code Playgroud)

然后你只需要像这样声明字典:

var dict = new Dictionary<Guid, Element>(ReverseGuidEqualityComparer.Default);
Run Code Online (Sandbox Code Playgroud)

看一下差异的一点测试:

private static void Increment(byte[] x)
{
    for (int i = x.Length - 1; i >= 0; i--)
    {
        if (x[i] != 0xFF)
        {
            x[i]++;
            return;
        }

        x[i] = 0;
    }
}
Run Code Online (Sandbox Code Playgroud)

// You can try timing this program with the default GetHashCode:
//var dict = new Dictionary<Guid, object>();
var dict = new Dictionary<Guid, object>(ReverseGuidEqualityComparer.Default);
var hs1 = new HashSet<int>();
var hs2 = new HashSet<int>();

{
    var guid = Guid.NewGuid();

    Stopwatch sw = Stopwatch.StartNew();

    for (int i = 0; i < 1500000; i++)
    {
        hs1.Add(ReverseGuidEqualityComparer.Default.GetHashCode(guid));
        hs2.Add(guid.GetHashCode());
        dict.Add(guid, new object());
        var bytes = guid.ToByteArray();
        Increment(bytes);
        guid = new Guid(bytes);
    }

    sw.Stop();

    Console.WriteLine("Milliseconds: {0}", sw.ElapsedMilliseconds);
}

Console.WriteLine("ReverseGuidEqualityComparer distinct hashes: {0}", hs1.Count);
Console.WriteLine("Guid.GetHashCode() distinct hashes: {0}", hs2.Count);
Run Code Online (Sandbox Code Playgroud)

随着顺序Guid,不同哈希码的数量的差异是惊人的:

ReverseGuidEqualityComparer distinct hashes: 1500000
Guid.GetHashCode() distinct hashes: 256
Run Code Online (Sandbox Code Playgroud)

现在......如果你不想使用ToByteArray()(因为它分配无用的内存),有一个使用反射和表达式树的解决方案...它应该与Mono正常工作,因为Mono"对齐"它的实现Guid2004年微软之一,那是古代:-)

public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
    public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();

    public static readonly Func<Guid, int> GetHashCodeFunc;

    static ReverseGuidEqualityComparer()
    {
        var par = Expression.Parameter(typeof(Guid));
        var hash = Expression.Variable(typeof(int));

        var const23 = Expression.Constant(23);

        var const8 = Expression.Constant(8);
        var const16 = Expression.Constant(16);
        var const24 = Expression.Constant(24);

        var b = Expression.Convert(Expression.Convert(Expression.Field(par, "_b"), typeof(ushort)), typeof(uint));
        var c = Expression.Convert(Expression.Convert(Expression.Field(par, "_c"), typeof(ushort)), typeof(uint));
        var d = Expression.Convert(Expression.Field(par, "_d"), typeof(uint));
        var e = Expression.Convert(Expression.Field(par, "_e"), typeof(uint));
        var f = Expression.Convert(Expression.Field(par, "_f"), typeof(uint));
        var g = Expression.Convert(Expression.Field(par, "_g"), typeof(uint));
        var h = Expression.Convert(Expression.Field(par, "_h"), typeof(uint));
        var i = Expression.Convert(Expression.Field(par, "_i"), typeof(uint));
        var j = Expression.Convert(Expression.Field(par, "_j"), typeof(uint));
        var k = Expression.Convert(Expression.Field(par, "_k"), typeof(uint));

        var sc = Expression.LeftShift(c, const16);
        var se = Expression.LeftShift(e, const8);
        var sf = Expression.LeftShift(f, const16);
        var sg = Expression.LeftShift(g, const24);
        var si = Expression.LeftShift(i, const8);
        var sj = Expression.LeftShift(j, const16);
        var sk = Expression.LeftShift(k, const24);

        var body = Expression.Block(new[]
        {
            hash
        },
        new Expression[]
        {
            Expression.Assign(hash, Expression.Constant(37)),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Field(par, "_a")),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Convert(Expression.Or(b, sc), typeof(int))),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Convert(Expression.Or(d, Expression.Or(se, Expression.Or(sf, sg))), typeof(int))),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Convert(Expression.Or(h, Expression.Or(si, Expression.Or(sj, sk))), typeof(int))),
            hash
        });

        GetHashCodeFunc = Expression.Lambda<Func<Guid, int>>(body, par).Compile();
    }

    #region IEqualityComparer<Guid> Members

    public bool Equals(Guid x, Guid y)
    {
        return x.Equals(y);
    }

    public int GetHashCode(Guid obj)
    {
        return GetHashCodeFunc(obj);
    }

    #endregion

    // For comparison purpose, not used
    public int GetHashCodeSimple(Guid obj)
    {
        var bytes = obj.ToByteArray();

        unchecked
        {
            int hash = 37;

            hash = hash * 23 + (int)((uint)bytes[0] | ((uint)bytes[1] << 8) | ((uint)bytes[2] << 16) | ((uint)bytes[3] << 24));
            hash = hash * 23 + (int)((uint)bytes[4] | ((uint)bytes[5] << 8) | ((uint)bytes[6] << 16) | ((uint)bytes[7] << 24));
            hash = hash * 23 + (int)((uint)bytes[8] | ((uint)bytes[9] << 8) | ((uint)bytes[10] << 16) | ((uint)bytes[11] << 24));
            hash = hash * 23 + (int)((uint)bytes[12] | ((uint)bytes[13] << 8) | ((uint)bytes[14] << 16) | ((uint)bytes[15] << 24));

            return hash;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

其他解决方案,基于"未记录但工作"的编程(在.NET和Mono上测试):

public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
    public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();

    #region IEqualityComparer<Guid> Members

    public bool Equals(Guid x, Guid y)
    {
        return x.Equals(y);
    }

    public int GetHashCode(Guid obj)
    {
        GuidToInt32 gtoi = new GuidToInt32 { Guid = obj };

        unchecked
        {
            int hash = 37;

            hash = hash * 23 + gtoi.Int32A;
            hash = hash * 23 + gtoi.Int32B;
            hash = hash * 23 + gtoi.Int32C;
            hash = hash * 23 + gtoi.Int32D;

            return hash;
        }
    }

    #endregion

    [StructLayout(LayoutKind.Explicit)]
    private struct GuidToInt32
    {
        [FieldOffset(0)]
        public Guid Guid;

        [FieldOffset(0)]
        public int Int32A;
        [FieldOffset(4)]
        public int Int32B;
        [FieldOffset(8)]
        public int Int32C;
        [FieldOffset(12)]
        public int Int32D;
    }
}
Run Code Online (Sandbox Code Playgroud)

它使用StructLayout"技巧"将一个叠加Guid到一堆int,写入Guid和读取int.

为什么Guid.GetHashCode()有连续ID的问题?

很容易解释:从参考资料来看,GetHashCode()是:

return _a ^ (((int)_b << 16) | (int)(ushort)_c) ^ (((int)_f << 24) | _k);
Run Code Online (Sandbox Code Playgroud)

所以_d,_e,_g,_h,_i,_j bytes为没有哈希代码的一部分.当递增a时Guid,首先在_k字段中递增(256个值),然后在_j字段中溢出(256*256个值,因此65536个值),然后在_i字段上(16777216个值).显然,通过不散列_h,_i,_j字段顺序的散列Guid将只显示256个不同的值,非巨大的范围Guid(或最大512个不同的值,如果该_f域递增一次,就像如果你以一个Guid类似12345678-1234-1234-1234-aaffffffff00,其中aa(即"我们的" _f)将增加到ab256之后的增量Guid)

  • @Rawling 它是为随机散列而构建的,而不是为顺序散列构建的。 (2认同)