Julia:从数组中检测并删除重复的行?

Fra*_*art 3 duplicates julia

在 Julia 中检测和删除数组中重复行的最佳方法是什么?

x = Integer.(round.(10 .* rand(1000,4)))

# In R I would apply the duplicated function.
x = x[duplicated(x),:]
Run Code Online (Sandbox Code Playgroud)

Gni*_*muc 5

unique is what you are looking for: (this does not answer the question for the detection part.)

julia> x = Integer.(round.(10 .* rand(1000,4)))
1000×4 Array{Int64,2}:
 7  3  10   1
 7  4   8   9
 7  7   3   0
 3  4   8   2
 ?           
julia> unique(x, 1)
973×4 Array{Int64,2}:
 7  3  10   1
 7  4   8   9
 7  7   3   0
 3  4   8   2
 ?  
Run Code Online (Sandbox Code Playgroud)

As for the detection part, a dirty fix would be editing this line:

@nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))
Run Code Online (Sandbox Code Playgroud)

to:

(@nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))), uniquerows
Run Code Online (Sandbox Code Playgroud)

Alternatively, you could define your own unique2 with abovementioned changes:

using Base.Cartesian
import Base.Prehashed

@generated function unique2(A::AbstractArray{T,N}, dim::Int) where {T,N}
......
end

julia> y, idx = unique2(x, 1)

julia> y
960×4 Array{Int64,2}:
  8   3   1   5
  8   3   1   6
  1   1   0   1
  8  10   1  10
  9   1   8   7
  ? 

julia> setdiff(1:1000, idx)
40-element Array{Int64,1}:
  99
 120
 132
 140
 216
 227
  ? 
Run Code Online (Sandbox Code Playgroud)

The benchmark on my machine is:

x = rand(1:10,1000,4) # 48 dups
@btime unique2($x, 1); 
124.342 ?s (31 allocations: 145.97 KiB)
@btime duplicated($x);
407.809 ?s (9325 allocations: 394.78 KiB) 

x = rand(1:4,1000,4) # 751 dups
@btime unique2($x, 1);
66.062 ?s (25 allocations: 50.30 KiB)
@btime duplicated($x);
222.337 ?s (4851 allocations: 237.88 KiB)
Run Code Online (Sandbox Code Playgroud)

The result shows that the convoluted-metaprogramming-hashtable way in Base benefits a lot from lower memory allocation.