HashSet中的Union vs Unionwith

Question

HashSet中的Union vs Unionwith

当我组合2个哈希集时,HashSet.Union vs 之间的区别是什么 HashSet.Unionwith.

我想这样组合:

HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;
        enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll;

Run Code Online (Sandbox Code Playgroud)

这个例子的最佳方法是什么？为什么？

Answer 1

Tim*_*ter 14

好吧,它不是HashSet.Union,但是Enumerable.Union,让你使用的是与任何类型的工作的LINQ扩展方法IEnumerable<>,而HashSet.UnionWith是真正的HashSet是modifes当前实例方法.

Union 返回一个 IEnumerable<TSource>
UnionWith是void,它修改当前HashSet实例
也许UnionWith稍微更高效,因为它可以被优化

如果您不希望在方法中支持任何类型的序列,那么HashSet修复它并且您可以修改它,使用它,否则使用LINQ扩展.如果你HashSet为此目的创建实例它并不重要,我希望LINQ更灵活,并能够链接我的查询.

Answer 2

Kas*_*erg 14

鉴于一个和乙有四种方法来一个 ∪ 乙:HashSet<T> HashSet<T>

new HashSet<t>(A.Union(B))
_{(见HashSet<T&>(IEnumerable<T>)和
Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>))}
A.UnionWith(B)
HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
new HashSet<t>(A.Concat(B))
_{(见Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>))}

每个都有其优点和缺点:

1和4是导致a HashSet而2和3是语句或语句块的表达式.
表达式1和4可以在2和3之外的更多位置使用.例如,在linq查询语法表达式中使用2或3是很麻烦的:
from x in setofSetsA as IEnumerable<HashSet<T>> from y in setOfSetsB as IEnumerable<HashSet<T>> select x.UnionWith(y)因为UnionWith返回void 将无法工作.
1,3和4保存甲和乙因为它们并返回一个新的组,而2修改甲.
在某些情况下,修改其中一个原始集合是不好的,并且存在至少一个原始集合可以被修改而没有负面后果的情况.
计算成本:

A.UnionWith(B)
(≈O((log(|A∪B|) - log(| A |))*|A∪B|)+ O(| B |))

≤

HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
(≈O((log(|A∪B|) - log(| A |))*|A∪B|)+ O(| A | + | B |))

≤

HashSet<T>(A.Concat(B))
(≈O(log(|A∪B|)*|A∪B|)+ O(| A | + | B |))

≤

HashSet<T>(A.Union(B))
(≈2*O(log(|A∪B|)*|A∪B|)+ O(| A | + | B | + |A∪B|))

下一节将深入研究参考源,以查看这些性能估计的基础.

性能

`HashSet<T>`

在联合选项1,3和4中,构造函数HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)用于创建一个HashSet<T>from IEnumerable<T>.如果传递IEnumerable<T>的Count属性为 -ie,如果它是ICollection<T>- ,则此属性用于设置新的大小HashSet:

int suggestedCapacity = 0;
ICollection<T> coll = collection as ICollection<T>;
if (coll != null) {
    suggestedCapacity = coll.Count;
}
Initialize(suggestedCapacity);

Run Code Online (Sandbox Code Playgroud)

- HashSet.cs第136-141行

[Count()][10]永远不会调用该方法.因此,如果IEnumerable可以毫不费力地检索计数,则用于预留容量; 否则,HashSet在添加新元素时增长并重新分配.
因此,在选项1 A.Union(B)和选项4 A.Concat(B)中ICollection<T>,创建的HashSet将不会增长并重新分配一些(≈log(|A∪B|))次.选项3可以使用Count的甲.

构造函数调用UnionWith以填充新的空HashSet:

this.UnionWith(collection);
Run Code Online (Sandbox Code Playgroud)
- HashSet.cs第143行

UnionWith(IEnumerable<T>)迭代IEnumerable<T>传递的参数中的元素并调用AddIfNotPresent(T)每个元素.

AddIfNotPresent(T)插入元素并确保重复项永远不会插入到集合中.
HashSet<T>实现为一个插槽阵列m_slots,以及一个桶阵列m_buckets.存储桶只包含数组的int索引m_slots.每个桶的SlotS IN m_slots形式链表与索引到下一个Slot中m_slots.

AddIfNotPresent(T) 跳转到正确的存储桶,然后遍历其链接列表以检查该元素是否已存在:

for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
    if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) {
        return false;
    }
}

Run Code Online (Sandbox Code Playgroud)

- HashSet.cs968-975行

接下来找到一个空闲索引并保留一个插槽.首先m_freelist,检查空闲插槽列表.当空闲列表中没有插槽时,将m_slots使用阵列中的下一个空插槽.IncreaseCapacity()如果空闲列表中没有插槽且没有空插槽,则保留更多容量(via ):

int index;
if (m_freeList >= 0) {
    index = m_freeList;
    m_freeList = m_slots[index].next;
}
else {
    if (m_lastIndex == m_slots.Length) {
        IncreaseCapacity();
        // this will change during resize
        bucket = hashCode % m_buckets.Length;
    }
    index = m_lastIndex;
    m_lastIndex++;
}

Run Code Online (Sandbox Code Playgroud)

- HashSet.cs第977-990行

AddIfNotPresent(T)有三个操作需要一些计算:调用object.GetHashCode(),object.Equals(object)在发生碰撞时调用,以及IncreaseCapacity().实际添加元素只会产生设置一些指针和一些整数的成本.

当容量HashSet<T>需求IncreaseCapacity()至少翻倍时.因此,我们可以得出结论,平均a HashSet<T>填充了75%.如果散列均匀分布,则哈希冲突的预期也为75%.

SetCapacity(int, bool),被称为IncreaseCapacity(),是最昂贵的:它分配新的数组,它将旧的插槽阵列复制到新的数组,并重新计算存储桶列表:

Slot[] newSlots = new Slot[newSize];
if (m_slots != null) {
    Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex);
}

...

int[] newBuckets = new int[newSize];
for (int i = 0; i < m_lastIndex; i++) {
    int bucket = newSlots[i].hashCode % newSize;
    newSlots[i].next = newBuckets[bucket] - 1;
    newBuckets[bucket] = i + 1;
}
m_slots = newSlots;
m_buckets = newBuckets;

Run Code Online (Sandbox Code Playgroud)

- HashSet.cs第929-949行

选项1和4(new HashSet<T>(A.Union(B)))将导致稍多的调用IncreaseCapacity().没有成本A.Union(B)或A.Concat(B)- 的成本约为O(log(|A∪B|)*|A∪B|).
当使用选项2(A.UnionWith(B))或选项3(HashSet<T> C = new HashSet<T>(A); C.UnionWith(B))时,我们在成本上得到log(| A |)的"折扣":O((log(|A∪B|) - log(| A |))*| A∪B|).它(稍微)支付使用最大的集合作为目标进入另一个被合并的女巫.

`Enumerable<T>.Union(IEnumerable<T>)`

Enumerable<T>.Union(IEnumerable<T>)通过实现UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>).
在UnionIterator使用Set<T>-an内部类Enumerable.cs-这是非常相似的HashSet<T>. UnionIterator懒惰地Add(T)š从物品甲和乙这Set<T>和yields元素如果它们可以被添加.完成的工作Find(T, bool)类似于HashSet<T>.AddIfNotPresent(T).检查元素是否已存在:

int hashCode = InternalGetHashCode(value);
for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) {
    if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true;
}

Run Code Online (Sandbox Code Playgroud)

- Enumerable.cs第2423-2426行

找一个免费索引并保留一个插槽:

int index;
if (freeList >= 0) {
    index = freeList;
    freeList = slots[index].next;
}
else {
    if (count == slots.Length) Resize();
    index = count;
    count++;
}
int bucket = hashCode % buckets.Length;
slots[index].hashCode = hashCode;
slots[index].value = value;
slots[index].next = buckets[bucket] - 1;
buckets[bucket] = index + 1;

Run Code Online (Sandbox Code Playgroud)

- Enumerable.cs第2428-2442行

Resize()类似于IncreaseCapacity().两者之间的最大区别在于Resize()不使用素数作为桶的数量,因此如果不好GetHashCode()则碰撞的可能性稍高.代码Resize():

int newSize = checked(count * 2 + 1);
int[] newBuckets = new int[newSize];
Slot[] newSlots = new Slot[newSize];
Array.Copy(slots, 0, newSlots, 0, count);
for (int i = 0; i < count; i++) {
    int bucket = newSlots[i].hashCode % newSize;
    newSlots[i].next = newBuckets[bucket] - 1;
    newBuckets[bucket] = i + 1;
}
buckets = newBuckets;
slots = newSlots;

Run Code Online (Sandbox Code Playgroud)

- Enumerable.cs第2448-2458行

性能成本与之A.Union(B)没有显着差异HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B);.在选项1(new HashSet<T>(A.Union(B)))中,HashSet创建两次相同的结果导致非常昂贵的2*O(log(|A∪B|)*(|A∪B|)).选项从知道如何4个结果HashSet<T>(IEnumerable<T>)和Enumerable.Union(IEnumerable<T>, IEnumerable<T>)实现.它避免了冗余A.Union(B)导致O(log(|A∪B|)*|A∪B|)的成本.

归档时间：	8 年，6 月前
查看次数：	9472 次
最近记录：	8 年，3 月前