朱莉娅:如何返回数组中唯一元素的数量

xia*_*dai 6 julia

在Julia中返回数组中唯一元素数量的函数是什么?

在R中,您有length(unique(x))。我可以在Julia中做同样的事情,但我认为应该有一种更有效的方法。

Ste*_*ski 6

如果您想要一个准确的答案length(unique(x)),那么它对一般对象的效率就很高。如果您的值具有有限的域,例如UInt8,使用固定大小的表可能会更有效。如果您可以接受近似值,则可以使用在OnlineStats包中实现的HyperLogLog数据结构/算法:

https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStats.HyperLogLog


Cam*_*nek 6

似乎length(Set(x))比 快一些length(unique(x))

julia> using StatsBase, BenchmarkTools

julia> num_unique(x) = length(Set(x));

julia> a = sample(1:100, 200);

julia> num_unique(x) == length(unique(x))
true

julia> @benchmark length(unique(x)) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial: 
  memory estimate:  450.50 KiB
  allocs estimate:  36
  --------------
  minimum time:     498.130 ?s (0.00% GC)
  median time:      570.588 ?s (0.00% GC)
  mean time:        579.011 ?s (2.41% GC)
  maximum time:     2.321 ms (63.03% GC)
  --------------
  samples:          5264
  evals/sample:     1

julia> @benchmark num_unique(x) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial: 
  memory estimate:  288.68 KiB
  allocs estimate:  8
  --------------
  minimum time:     283.031 ?s (0.00% GC)
  median time:      393.317 ?s (0.00% GC)
  mean time:        397.878 ?s (4.24% GC)
  maximum time:     33.499 ms (98.80% GC)
  --------------
  samples:          6704
  evals/sample:     1
Run Code Online (Sandbox Code Playgroud)

字符串数组的另一个基准测试:

julia> using Random

julia> @benchmark length(unique(x)) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial: 
  memory estimate:  450.50 KiB
  allocs estimate:  36
  --------------
  minimum time:     818.024 ?s (0.00% GC)
  median time:      895.944 ?s (0.00% GC)
  mean time:        906.568 ?s (1.61% GC)
  maximum time:     1.964 ms (51.19% GC)
  --------------
  samples:          3049
  evals/sample:     1

julia> @benchmark num_unique(x) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial: 
  memory estimate:  144.68 KiB
  allocs estimate:  8
  --------------
  minimum time:     367.018 ?s (0.00% GC)
  median time:      378.666 ?s (0.00% GC)
  mean time:        384.486 ?s (1.07% GC)
  maximum time:     1.314 ms (70.80% GC)
  --------------
  samples:          4527
  evals/sample:     1
Run Code Online (Sandbox Code Playgroud)