在Julia中返回数组中唯一元素数量的函数是什么?
在R中,您有length(unique(x))。我可以在Julia中做同样的事情,但我认为应该有一种更有效的方法。
如果您想要一个准确的答案length(unique(x)),那么它对一般对象的效率就很高。如果您的值具有有限的域,例如UInt8,使用固定大小的表可能会更有效。如果您可以接受近似值,则可以使用在OnlineStats包中实现的HyperLogLog数据结构/算法:
https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStats.HyperLogLog
似乎length(Set(x))比 快一些length(unique(x))。
julia> using StatsBase, BenchmarkTools
julia> num_unique(x) = length(Set(x));
julia> a = sample(1:100, 200);
julia> num_unique(x) == length(unique(x))
true
julia> @benchmark length(unique(x)) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial:
memory estimate: 450.50 KiB
allocs estimate: 36
--------------
minimum time: 498.130 ?s (0.00% GC)
median time: 570.588 ?s (0.00% GC)
mean time: 579.011 ?s (2.41% GC)
maximum time: 2.321 ms (63.03% GC)
--------------
samples: 5264
evals/sample: 1
julia> @benchmark num_unique(x) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial:
memory estimate: 288.68 KiB
allocs estimate: 8
--------------
minimum time: 283.031 ?s (0.00% GC)
median time: 393.317 ?s (0.00% GC)
mean time: 397.878 ?s (4.24% GC)
maximum time: 33.499 ms (98.80% GC)
--------------
samples: 6704
evals/sample: 1
Run Code Online (Sandbox Code Playgroud)
字符串数组的另一个基准测试:
julia> using Random
julia> @benchmark length(unique(x)) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial:
memory estimate: 450.50 KiB
allocs estimate: 36
--------------
minimum time: 818.024 ?s (0.00% GC)
median time: 895.944 ?s (0.00% GC)
mean time: 906.568 ?s (1.61% GC)
maximum time: 1.964 ms (51.19% GC)
--------------
samples: 3049
evals/sample: 1
julia> @benchmark num_unique(x) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial:
memory estimate: 144.68 KiB
allocs estimate: 8
--------------
minimum time: 367.018 ?s (0.00% GC)
median time: 378.666 ?s (0.00% GC)
mean time: 384.486 ?s (1.07% GC)
maximum time: 1.314 ms (70.80% GC)
--------------
samples: 4527
evals/sample: 1
Run Code Online (Sandbox Code Playgroud)