在R中分箱数据

use*_*669 5 r binning

我有一个大约4000个值的向量.我只需要将它分成60个相等的间隔,然后我必须计算中位数(对于每个箱子).

v<-c(1:4000)
Run Code Online (Sandbox Code Playgroud)

V实际上只是一个向量.我读过切割但是需要我指定断点.我只想要60个相等的间隔

Tho*_*mas 15

使用cuttapply:

> tapply(v, cut(v, 60), median)
          (-3,67.7]          (67.7,134]           (134,201]           (201,268] 
               34.0               101.0               167.5               234.0 
          (268,334]           (334,401]           (401,468]           (468,534] 
              301.0               367.5               434.0               501.0 
          (534,601]           (601,668]           (668,734]           (734,801] 
              567.5               634.0               701.0               767.5 
          (801,867]           (867,934]         (934,1e+03]    (1e+03,1.07e+03] 
              834.0               901.0               967.5              1034.0 
(1.07e+03,1.13e+03]  (1.13e+03,1.2e+03]  (1.2e+03,1.27e+03] (1.27e+03,1.33e+03] 
             1101.0              1167.5              1234.0              1301.0 
 (1.33e+03,1.4e+03]  (1.4e+03,1.47e+03] (1.47e+03,1.53e+03]  (1.53e+03,1.6e+03] 
             1367.5              1434.0              1500.5              1567.0 
 (1.6e+03,1.67e+03] (1.67e+03,1.73e+03]  (1.73e+03,1.8e+03]  (1.8e+03,1.87e+03] 
             1634.0              1700.5              1767.0              1834.0 
(1.87e+03,1.93e+03]    (1.93e+03,2e+03]    (2e+03,2.07e+03] (2.07e+03,2.13e+03] 
             1900.5              1967.0              2034.0              2100.5 
 (2.13e+03,2.2e+03]  (2.2e+03,2.27e+03] (2.27e+03,2.33e+03]  (2.33e+03,2.4e+03] 
             2167.0              2234.0              2300.5              2367.0 
 (2.4e+03,2.47e+03] (2.47e+03,2.53e+03]  (2.53e+03,2.6e+03]  (2.6e+03,2.67e+03] 
             2434.0              2500.5              2567.0              2634.0 
(2.67e+03,2.73e+03]  (2.73e+03,2.8e+03]  (2.8e+03,2.87e+03] (2.87e+03,2.93e+03] 
             2700.5              2767.0              2833.5              2900.0 
   (2.93e+03,3e+03]    (3e+03,3.07e+03] (3.07e+03,3.13e+03]  (3.13e+03,3.2e+03] 
             2967.0              3033.5              3100.0              3167.0 
 (3.2e+03,3.27e+03] (3.27e+03,3.33e+03]  (3.33e+03,3.4e+03]  (3.4e+03,3.47e+03] 
             3233.5              3300.0              3367.0              3433.5 
(3.47e+03,3.53e+03]  (3.53e+03,3.6e+03]  (3.6e+03,3.67e+03] (3.67e+03,3.73e+03] 
             3500.0              3567.0              3633.5              3700.0 
 (3.73e+03,3.8e+03]  (3.8e+03,3.87e+03] (3.87e+03,3.93e+03]    (3.93e+03,4e+03] 
             3767.0              3833.5              3900.0              3967.0
Run Code Online (Sandbox Code Playgroud)


MrF*_*ick 4

以前我用过这个功能

evenbins <- function(x, bin.count=10, order=T) {
    bin.size <- rep(length(x) %/% bin.count, bin.count)
    bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1, 0)
    bin <- rep(1:bin.count, bin.size)
    if(order) {    
        bin <- bin[rank(x,ties.method="random")]
    }
    return(factor(bin, levels=1:bin.count, ordered=order))
}
Run Code Online (Sandbox Code Playgroud)

然后我可以运行它

v.bin <- evenbins(v, 60)
Run Code Online (Sandbox Code Playgroud)

并检查尺寸

table(v.bin)
Run Code Online (Sandbox Code Playgroud)

并看到它们都包含 66 或 67 个元素。默认情况下,这将对值进行排序,cut因此每个因子水平的值都会递增。如果您想根据原始顺序对它们进行分类,

v.bin <- evenbins(v, 60, order=F)
Run Code Online (Sandbox Code Playgroud)

反而。这只是按照数据出现的顺序分割数据