R data.table - 按不同采样比例分组采样

Question

R data.table - 按不同采样比例分组采样

我想从 a 中有效地按组进行随机样本data.table，但应该可以为每个组抽取不同的比例。

如果我想从每个组中抽取分数，我可以从这个问题和相关sampling_fraction答案中得到启发，做一些类似的事情：

DT = data.table(a = sample(1:2), b = sample(1:1000,20))

group_sampler <- function(data, group_col, sample_fraction){
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}

# what % of data should be sampled
sampling_fraction = 0.5

# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)

Run Code Online (Sandbox Code Playgroud)

但是，如果我想从第 1 组中抽取 10%，从第 2 组中抽取 50%，该怎么办？

Answer 1

sin*_*dur 5

您可以使用.GRP但来确保匹配正确的组。您可能希望将其定义group_col为因子变量。

group_sampler <- function(data, group_col, sample_fractions) {
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
  data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}

Run Code Online (Sandbox Code Playgroud)

编辑回应chinsoon12的评论：

函数的最后一行会更安全（而不是依赖正确的顺序）：

data[, .SD[sample(.N, ceiling(.N*sample_fractions[[unlist(.BY)]]))], keyby = group_col]

Run Code Online (Sandbox Code Playgroud)

然后你sample_fractions作为一个命名向量传递：

group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，4 月前
查看次数：	927 次
最近记录：	6 年，4 月前