如何缩小/过滤每组数据？

Question

如何缩小/过滤每组数据？

我有一个包含一些列和分组变量的数据集。我想减少每个分组变量的数据集、max_n每个分组级别的行数。同时我想保留其他列的分布。我的意思是我想保留数据过滤后的最低值a和最高值。b这就是为什么我使用下面的函数setorderv。

library(data.table)

set.seed(22)
n=20
max_n = 6
dt <- data.table("grp"=sample(c("a", "b", "c"), n, replace=T),
                 "a"=sample(1:10, n, replace=T),
                 "b"=sample(1:20, n, replace=T),
                 "id"=1:n)
setorderv(dt, c("grp", "a", "b"))
dt

Run Code Online (Sandbox Code Playgroud)

我的临时解决方案不太优雅，也不太像 data.table 那样，如下所示：

dt_new <- data.table()
for (gr in unique(dt[["grp"]])) {
  tmp <- dt[grp == gr, ]
  n_tmp <- nrow(tmp)
  if (n_tmp > max_n) {
    tmp <- tmp[as.integer(seq(1, n_tmp, length.out=max_n)),]
  }
  dt_new <- rbindlist(list(dt_new, tmp))
}

Run Code Online (Sandbox Code Playgroud)

有没有更优雅的方法来做到这一点？编辑：我想要一个 data.table 解决方案。

现在的代码太庞大

Answer 1

r2e*_*ans 5

要保持 a 中的最小值（a和b）、最大值（同上）和总行max_n数data.table：

dt[, minmax := a %in% range(a) | b %in% range(b), by = grp]
set.seed(42)
dt[, .SD[minmax | 1:.N %in% head(sample(which(!minmax)), max_n - sum(minmax)),], grp]
#        grp     a        id minmax
#     <char> <int> <int> <int> <lgcl>
#  1:      a     1    11    14   TRUE
#  2:      a     2     9    13  FALSE
#  3:      a     2    19    17   TRUE
#  4:      a     5     7     6   TRUE
#  5:      a     8    12    19  FALSE
#  6:      a     9    11     7   TRUE
#  7:      b     1    20     1   TRUE
#  8:      b     2     1    16   TRUE
#  9:      b     3    19     3  FALSE
# 10:      b     4     3    11  FALSE
# 11:      b     7    10    18  FALSE
# 12:      b     9    17    10   TRUE
# 13:      c     1    16    12   TRUE
# 14:      c     3    14    20  FALSE
# 15:      c     5    18     9  FALSE
# 16:      c     6    20     5   TRUE
# 17:      c     7    13     8   TRUE
dt[, minmax := NULL] # cleanup

Run Code Online (Sandbox Code Playgroud)

演练：

minmax为 true，其中a或b是每组的最小/最大（最小/最大按变量按组）
which(!minmax)返回剩余行的行索引（其中a和b不是最小值/最大值）
sample(.)随机化剩余行索引的列表，并返回不超过以行结束所需的行head(., max_n - sum(minmax))数max_n
minmax | 1:.N %in% ..减少到行；在特殊情况下，不包括 a/b 的最小值/最大值的行数小于max_n，这保证返回所有行

数据

dt <- data.table::as.data.table(structure(list(grp = c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c"), a = c(1L, 2L, 2L, 5L, 8L, 8L, 9L, 1L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 1L, 3L, 5L, 6L, 7L), b = c(11L, 9L, 19L, 7L, 11L, 12L, 11L, 20L, 1L, 19L, 3L, 3L, 10L, 10L, 17L, 16L, 14L, 18L, 20L, 13L), id = c(14L, 13L, 17L, 6L, 2L, 19L, 7L, 1L, 16L, 3L, 11L, 15L, 18L, 4L, 10L, 12L, 20L, 9L, 5L, 8L)), row.names = c(NA, -20L), class = c("data.table", "data.frame")))

Run Code Online (Sandbox Code Playgroud)

欢迎来到SO，EnFiFa！ (2认同)

归档时间：	2 年，7 月前
查看次数：	76 次
最近记录：	2 年，7 月前