R中的组/ bin/bucket数据,每桶获取计数和每个桶的值总和

Fre*_*ill 9 aggregate r binning

我想分组/分组/ bin数据:

C1             C2       C3
49488.01172    0.0512   54000
268221.1563    0.0128   34399
34775.96094    0.0128   54444
13046.98047    0.07241  61000
2121699.75     0.00453  78921
71155.09375    0.0181   13794
1369809.875    0.00453  12312
750            0.2048   43451
44943.82813    0.0362   49871
85585.04688    0.0362   18947
31090.10938    0.0362   13401
68550.40625    0.0181   14345
Run Code Online (Sandbox Code Playgroud)

我想用C2值进行存储,但我希望定义存储桶,例如<= 0.005,<=.010,<=.014等.正如您所看到的,存储区间将是不均匀的.我想要每桶的C1计数以及每个桶的C1总和.

我不知道从哪里开始,因为我是一个相当新的R用户.有没有人愿意帮我弄清楚代码或指导我一个能满足我需求的例子?

编辑:添加了另一列C3.我需要每桶的C3总和以及每桶的C1和数量

akr*_*run 12

从评论中,"C2"似乎是带有%后缀的"字符"列.之前,创建一个组,删除%使用sub,转换为"数字"(as.numeric).transform(df,...)通过使用cut具有breaks(组桶/间隔)和labels(对于期望的组标签)参数的函数来创建变量"group" .创建组变量后sum,"group"中的"C1"和"group"中元素的"count"可以使用aggregate"base R"完成

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
    breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
      labels=c('<0.005', 0.005, 0.01, 0.014)))

 res <- do.call(data.frame,aggregate(C1~group, df1, 
        FUN=function(x) c(Count=length(x), Sum=sum(x))))

 dNew <- data.frame(group=levels(df1$group))
 merge(res, dNew, all=TRUE)
 #   group C1.Count    C1.Sum
 #1 <0.005        2 3491509.6
 #2  0.005       NA        NA
 #3   0.01        2  302997.1
 #4  0.014        8  364609.5
Run Code Online (Sandbox Code Playgroud)

或者你可以使用data.table.setDT转换data.framedata.table.指定"分组"变量by=并在其中汇总/创建两个变量"Count"和"Sum" list(. .N给出每个"组"中元素的数量.

 library(data.table)
  setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
Run Code Online (Sandbox Code Playgroud)

或使用dplyr.在%>%该LHS与RHS参数和铁链把他们连接在一起.使用group_by指定的"组"变量,然后使用summarise_eachsummarise获得汇总数和sum有关列. summarise_each如果有多个列,则会很有用.

 library(dplyr)
 df1 %>%
      group_by(group) %>% 
      summarise_each(funs(n(), Sum=sum(.)), C1)
Run Code Online (Sandbox Code Playgroud)

更新

使用新数据集 df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
                             labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, 
       FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
#  group C1.Count    C1.Sum C3.Count C3.Sum
#1 <0.005        2 3491509.6        2  91233
#2   0.01        2  302997.1        2  88843
#3  0.014        8  364609.5        8 268809
Run Code Online (Sandbox Code Playgroud)

你可以merge按照上面的详细说明去做.

dplyr除了指定附加变量之外,方法是相同的

 df1%>%
      group_by(group) %>%
       summarise_each(funs(n(), Sum=sum(.)), C1, C3)
 #Source: local data frame [3 x 5]

 #  group C1_n C3_n    C1_Sum C3_Sum
 #1 <0.005    2    2 3491509.6  91233
 #2   0.01    2    2  302997.1  88843
 #3  0.014    8    8  364609.5 268809
Run Code Online (Sandbox Code Playgroud)

数据

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", 
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", 
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", 
"C2"), row.names = c(NA, -12L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)