根据列值为唯一(A,B)或共享(两个)组添加新列,并按ID分组

ejg*_*ejg 1 compare group-by aggregate r

我的数据表格式如下:

id    source
1     A
1     B
2     A
3     B
4     A
4     B
Run Code Online (Sandbox Code Playgroud)

我想创建一个新的列,该列按分组id并具有反映相应source值的值(即.A, B, or Both)both如果id对应于A和,将使用的位置B.

我希望输出如下:

id    source    source_group
1     A         both
1     B         both
2     A         A
3     B         B
4     A         both
4     B         both
Run Code Online (Sandbox Code Playgroud)

如果您可以将其作为通用目的来处理其他值source,例如A, B, C, D, ... etc.

Ric*_*ven 5

你可以用 ave()

df$source_group <- with(df, {
  ave(as.character(source), id, FUN=function(x) if(length(x) > 1) "both" else x)
})
Run Code Online (Sandbox Code Playgroud)

这使

df
#   id source source_group
# 1  1      A         both
# 2  1      B         both
# 3  2      A            A
# 4  3      B            B
# 5  4      A         both
# 6  4      B         both
Run Code Online (Sandbox Code Playgroud)

或者像大卫建议的那样,我们可以使用data.table

library(data.table)
setDT(df)[, source_group := if(.N > 1) "both" else as.character(source), by = id]
Run Code Online (Sandbox Code Playgroud)

这使

df
#    id source source_group
# 1:  1      A         both
# 2:  1      B         both
# 3:  2      A            A
# 4:  3      B            B
# 5:  4      A         both
# 6:  4      B         both
Run Code Online (Sandbox Code Playgroud)

请注意,这两个都假定source列是因子类.

  • 我更倾向于比较`df`中的整体唯一值,比如`library(data.table); setDT(df)[,source_group:= if(setequal(unique(source),unique(df $ source)))"both"else as.character(source),by = id]`作为更通用的解决方案. (3认同)

Fra*_*ank 5

仅供参考,这是一个可以说是更合适的基准:

library(data.table)
library(dplyr)
library(microbenchmark)

DT = data.table(id=seq(1e5))[,
  .(source = c(if (runif(1) > .5) "A", if (runif(1) > .5)"B")), by=id]
DF = data.frame(DT)

microbenchmark(
dplyr = 
  DF %>% group_by(id) %>% mutate(gr = if(n()>1) "both" else as.character(source)),
dplyr_dt = 
  DT %>% group_by(id) %>% mutate(gr = if(n()>1) "both" else as.character(source)),
ave = DF$gr <- 
  ave(as.character(DF$source), DF$id, FUN = function(x) if(length(x) > 1) "both" else x),
dt  = DT[, gr := if (.N > 1) "both" else as.character(source), by=id],
dt2 = DT[, 
  gr := as.character(source)][ DT[, if (.N > 1) 1, by=id][, V1 := NULL], 
  gr := "both", on = "id"],
  times=10)
Run Code Online (Sandbox Code Playgroud)

结果:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
    dplyr 1200.13579 1215.56997 1328.73931 1245.81556 1252.66023 1828.02921    10
 dplyr_dt   38.43108   41.58004   47.98858   43.89661   49.27464   68.64005    10
      ave  149.67549  153.03421  167.09148  163.19261  181.60074  191.22481    10
       dt   32.31500   33.60741   41.00644   35.80188   37.60350   65.76292    10
      dt2   25.99567   26.44592   28.11141   28.19138   28.55474   31.42691    10
Run Code Online (Sandbox Code Playgroud)

我不知道为什么ave在这里做得更糟.也许正如@bunk所说,这ave对许多团体来说都不是很好.Dplyr在data.frame上很慢,但在使用data.table后端时(如所宣传的那样)更快.

对于它的价值,我的data.table解决方案有点不同(证明一个单独的答案?):

DT[,
  gr := as.character(source)
][DT[, if (.N > 1) 1, by=id][, V1 := NULL], 
  gr := "both"
, on = "id"]
Run Code Online (Sandbox Code Playgroud)

首先,它设置为gr等于source,然后用both具有两行的那些组替换它.