groupby() 在 julia 中使用两个数组？

Question

groupby() 在 julia 中使用两个数组？

我有两个具有相同维度的数组：

a1 = [1,1,3,4,6,6]
a2 = [1,2,3,4,5,6]

Run Code Online (Sandbox Code Playgroud)

我想根据数组对它们进行分组，a1并获得a2每个组的数组平均值。我的输出来自 array a2，如下所述：

result:
1.5
3.0
4.0
5.5

Run Code Online (Sandbox Code Playgroud)

请提出一种实现此任务的方法。谢谢！！

Answer 1

Bog*_*ski 5

这是使用 DataFrames.jl 的解决方案：

julia> using DataFrames, Statistics

julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);

julia> combine(groupby(df, :a1), :a2 => mean)
4×2 DataFrame
 Row ? a1     a2_mean
     ? Int64  Float64
??????????????????????
   1 ?     1      1.5
   2 ?     3      3.0
   3 ?     4      4.0
   4 ?     6      5.5

Run Code Online (Sandbox Code Playgroud)

编辑：

以下是时间安排（在 Julia 中，您需要记住第一次运行某些函数时必须对其进行编译，这需要时间）：

julia> using DataFrames, Statistics

(@v1.6) pkg> st DataFrames # I am using main branch, as it should be released this week
      Status `D:\.julia\environments\v1.6\Project.toml`
  [a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`

julia> df = DataFrame(a1=rand(1:1000, 10^8), a2=rand(10^8)); # 10^8 rows in 1000 random groups

julia> @time combine(groupby(df, :a1), :a2 => mean); # first run includes compilation time
  3.781717 seconds (6.76 M allocations: 1.151 GiB, 6.73% gc time, 84.20% compilation time)

julia> @time combine(groupby(df, :a1), :a2 => mean); # second run is just execution time
  0.442082 seconds (294 allocations: 762.990 MiB)

Run Code Online (Sandbox Code Playgroud)

请注意，类似数据上的例如 data.table（如果这是您的参考）明显变慢：

> library(data.table) # using 4 threads
> df = data.table(a1 = sample(1:1000, 10^8, replace=T), a2 = runif(10^8));
> system.time(df[, .(mean(a2)), by = a1])
   user  system elapsed 
   4.72    1.20    2.00

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，7 月前
查看次数：	160 次
最近记录：	4 年，7 月前