julia 数据帧中的熊猫 value_counts 有什么更好的等价物吗？

Question

julia 数据帧中的熊猫 value_counts 有什么更好的等价物吗？

我正在为 julia 中的数据帧中的一个系列寻找 Pandas 中非常方便的 value_counts 的等效项。

不幸的是，我在这里找不到任何东西，因此我对 julia 数据框中的 value_counts 的解决方案如下。但是，我不太喜欢我的解决方案，因为与使用方法的熊猫相比，它并不方便.value_counts()。所以我的问题是，还有其他（更方便）的选择吗？

jdf = DataFrame(rand(Int8, (1000000, 3)))

Run Code Online (Sandbox Code Playgroud)

这给了我：

? Row     ? x1   ? x2   ? x3   ?
?         ? Int8 ? Int8 ? Int8 ?
????????????????????????????????
? 1       ? -97  ? 98   ? 79   ?
? 2       ? -77  ? -118 ? -19  ?
?
? 999998  ? -115 ? 17   ? 107  ?
? 999999  ? -43  ? -64  ? 72   ?
? 1000000 ? 40   ? -11  ? 31   ?

Run Code Online (Sandbox Code Playgroud)

第一列的值计数为：

combine(nrow,groupby(jdf,:x1))

Run Code Online (Sandbox Code Playgroud)

返回：

? Row ? x1   ? nrow  ?
?     ? Int8 ? Int64 ?
??????????????????????
? 1   ? -97  ? 3942  ?
? 2   ? -77  ? 3986  ?
?
? 254 ? 12   ? 3899  ?
? 255 ? -92  ? 3973  ?
? 256 ? -49  ? 3952  ?

Run Code Online (Sandbox Code Playgroud)

Answer 1

Bog*_*ski 6

在 DataFrames.jl 中，这是获得所需结果的方法。通常，DataFrames.jl 中的方法是使用最少的 API。如果你combine(nrow,groupby(jdf,:x1))经常使用，那么你可以定义：

value_counts(df, col) = combine(groupby(df, col), nrow)

Run Code Online (Sandbox Code Playgroud)

在你的脚本中。

使用 FreqTables.jl 或 StatsBase.jl 实现您想要的替代方法：

julia> freqtable(jdf, :x1)
256-element Named Array{Int64,1}
x1   ?
???????????
-128 ? 3875
-127 ? 3931
-126 ? 3924
?         ?
125  ? 3873
126  ? 3917
127  ? 3975

julia> countmap(jdf.x1)
Dict{Int8,Int64} with 256 entries:
  -98  => 3925
  -74  => 4054
  11   => 3798
  -56  => 3853
  29   => 3765
  -105 => 3918
  ?    => ?

Run Code Online (Sandbox Code Playgroud)

（不同的是输出类型会有所不同）

在性能countmap方面最快，combine最慢：

julia> using BenchmarkTools

julia> @benchmark countmap($jdf.x1)
BenchmarkTools.Trial:
  memory estimate:  16.80 KiB
  allocs estimate:  14
  --------------
  minimum time:     436.000 ?s (0.00% GC)
  median time:      443.200 ?s (0.00% GC)
  mean time:        455.244 ?s (0.22% GC)
  maximum time:     5.362 ms (91.59% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark freqtable($jdf, :x1)
BenchmarkTools.Trial:
  memory estimate:  37.22 KiB
  allocs estimate:  86
  --------------
  minimum time:     7.972 ms (0.00% GC)
  median time:      8.089 ms (0.00% GC)
  mean time:        8.158 ms (0.00% GC)
  maximum time:     10.016 ms (0.00% GC)
  --------------
  samples:          613
  evals/sample:     1

julia> @benchmark combine(groupby($jdf,:x1), nrow)
BenchmarkTools.Trial:
  memory estimate:  23.28 MiB
  allocs estimate:  183
  --------------
  minimum time:     12.679 ms (0.00% GC)
  median time:      14.572 ms (8.68% GC)
  mean time:        15.239 ms (14.50% GC)
  maximum time:     20.385 ms (21.83% GC)
  --------------
  samples:          328
  evals/sample:     1

Run Code Online (Sandbox Code Playgroud)

请注意，combine大部分成本是分组，因此如果您已经GroupedDataFrame创建了对象，combine则相对较快：

julia> gdf = groupby(jdf,:x1);

julia> @benchmark combine($gdf, nrow)
BenchmarkTools.Trial:
  memory estimate:  16.16 KiB
  allocs estimate:  152
  --------------
  minimum time:     680.801 ?s (0.00% GC)
  median time:      714.800 ?s (0.00% GC)
  mean time:        737.568 ?s (0.15% GC)
  maximum time:     4.561 ms (83.47% GC)
  --------------
  samples:          6766
  evals/sample:     1

Run Code Online (Sandbox Code Playgroud)

编辑

如果你想要一个排序的字典然后加载 DataStructures.jl 然后执行：

sort!(OrderedDict(countmap(jdf.x1)))

Run Code Online (Sandbox Code Playgroud)

或者

 sort!(OrderedDict(countmap(jdf.x1)), byvalue=true)

Run Code Online (Sandbox Code Playgroud)

取决于你想对字典进行排序。

归档时间：	5 年，5 月前
查看次数：	464 次
最近记录：	5 年，5 月前