ejg*_*ejg 1 compare group-by aggregate r
我的数据表格式如下:
id source
1 A
1 B
2 A
3 B
4 A
4 B
Run Code Online (Sandbox Code Playgroud)
我想创建一个新的列,该列按分组id并具有反映相应source值的值(即.A, B, or Both)both如果id对应于A和,将使用的位置B.
我希望输出如下:
id source source_group
1 A both
1 B both
2 A A
3 B B
4 A both
4 B both
Run Code Online (Sandbox Code Playgroud)
如果您可以将其作为通用目的来处理其他值source,例如A, B, C, D, ... etc.
你可以用 ave()
df$source_group <- with(df, {
ave(as.character(source), id, FUN=function(x) if(length(x) > 1) "both" else x)
})
Run Code Online (Sandbox Code Playgroud)
这使
df
# id source source_group
# 1 1 A both
# 2 1 B both
# 3 2 A A
# 4 3 B B
# 5 4 A both
# 6 4 B both
Run Code Online (Sandbox Code Playgroud)
或者像大卫建议的那样,我们可以使用data.table
library(data.table)
setDT(df)[, source_group := if(.N > 1) "both" else as.character(source), by = id]
Run Code Online (Sandbox Code Playgroud)
这使
df
# id source source_group
# 1: 1 A both
# 2: 1 B both
# 3: 2 A A
# 4: 3 B B
# 5: 4 A both
# 6: 4 B both
Run Code Online (Sandbox Code Playgroud)
请注意,这两个都假定source列是因子类.
仅供参考,这是一个可以说是更合适的基准:
library(data.table)
library(dplyr)
library(microbenchmark)
DT = data.table(id=seq(1e5))[,
.(source = c(if (runif(1) > .5) "A", if (runif(1) > .5)"B")), by=id]
DF = data.frame(DT)
microbenchmark(
dplyr =
DF %>% group_by(id) %>% mutate(gr = if(n()>1) "both" else as.character(source)),
dplyr_dt =
DT %>% group_by(id) %>% mutate(gr = if(n()>1) "both" else as.character(source)),
ave = DF$gr <-
ave(as.character(DF$source), DF$id, FUN = function(x) if(length(x) > 1) "both" else x),
dt = DT[, gr := if (.N > 1) "both" else as.character(source), by=id],
dt2 = DT[,
gr := as.character(source)][ DT[, if (.N > 1) 1, by=id][, V1 := NULL],
gr := "both", on = "id"],
times=10)
Run Code Online (Sandbox Code Playgroud)
结果:
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 1200.13579 1215.56997 1328.73931 1245.81556 1252.66023 1828.02921 10
dplyr_dt 38.43108 41.58004 47.98858 43.89661 49.27464 68.64005 10
ave 149.67549 153.03421 167.09148 163.19261 181.60074 191.22481 10
dt 32.31500 33.60741 41.00644 35.80188 37.60350 65.76292 10
dt2 25.99567 26.44592 28.11141 28.19138 28.55474 31.42691 10
Run Code Online (Sandbox Code Playgroud)
我不知道为什么ave在这里做得更糟.也许正如@bunk所说,这ave对许多团体来说都不是很好.Dplyr在data.frame上很慢,但在使用data.table后端时(如所宣传的那样)更快.
对于它的价值,我的data.table解决方案有点不同(证明一个单独的答案?):
DT[,
gr := as.character(source)
][DT[, if (.N > 1) 1, by=id][, V1 := NULL],
gr := "both"
, on = "id"]
Run Code Online (Sandbox Code Playgroud)
首先,它设置为gr等于source,然后用both具有两行的那些组替换它.