朱莉娅：如何返回数组中唯一元素的数量

Question

朱莉娅：如何返回数组中唯一元素的数量

在Julia中返回数组中唯一元素数量的函数是什么？

在R中，您有length(unique(x))。我可以在Julia中做同样的事情，但我认为应该有一种更有效的方法。

Answer 1

如果您想要一个准确的答案length(unique(x))，那么它对一般对象的效率就很高。如果您的值具有有限的域，例如UInt8，使用固定大小的表可能会更有效。如果您可以接受近似值，则可以使用在OnlineStats包中实现的HyperLogLog数据结构/算法：

https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStats.HyperLogLog

Answer 2

Cam*_*nek 6

似乎length(Set(x))比快一些length(unique(x))。

julia> using StatsBase, BenchmarkTools

julia> num_unique(x) = length(Set(x));

julia> a = sample(1:100, 200);

julia> num_unique(x) == length(unique(x))
true

julia> @benchmark length(unique(x)) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial: 
  memory estimate:  450.50 KiB
  allocs estimate:  36
  --------------
  minimum time:     498.130 ?s (0.00% GC)
  median time:      570.588 ?s (0.00% GC)
  mean time:        579.011 ?s (2.41% GC)
  maximum time:     2.321 ms (63.03% GC)
  --------------
  samples:          5264
  evals/sample:     1

julia> @benchmark num_unique(x) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial: 
  memory estimate:  288.68 KiB
  allocs estimate:  8
  --------------
  minimum time:     283.031 ?s (0.00% GC)
  median time:      393.317 ?s (0.00% GC)
  mean time:        397.878 ?s (4.24% GC)
  maximum time:     33.499 ms (98.80% GC)
  --------------
  samples:          6704
  evals/sample:     1

Run Code Online (Sandbox Code Playgroud)

字符串数组的另一个基准测试：

julia> using Random

julia> @benchmark length(unique(x)) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial: 
  memory estimate:  450.50 KiB
  allocs estimate:  36
  --------------
  minimum time:     818.024 ?s (0.00% GC)
  median time:      895.944 ?s (0.00% GC)
  mean time:        906.568 ?s (1.61% GC)
  maximum time:     1.964 ms (51.19% GC)
  --------------
  samples:          3049
  evals/sample:     1

julia> @benchmark num_unique(x) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial: 
  memory estimate:  144.68 KiB
  allocs estimate:  8
  --------------
  minimum time:     367.018 ?s (0.00% GC)
  median time:      378.666 ?s (0.00% GC)
  mean time:        384.486 ?s (1.07% GC)
  maximum time:     1.314 ms (70.80% GC)
  --------------
  samples:          4527
  evals/sample:     1

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，2 月前
查看次数：	131 次
最近记录：	6 年，2 月前