因子分层抽样

Question

因子分层抽样

我有一个1000行的数据集,具有以下结构:

     device geslacht leeftijd type1 type2
1       mob        0       53     C     3
2       tab        1       64     G     7
3        pc        1       50     G     7
4       tab        0       75     C     3
5       mob        1       54     G     7
6        pc        1       58     H     8
7        pc        1       57     A     1
8        pc        0       68     E     5
9        pc        0       66     G     7
10      mob        0       45     C     3
11      tab        1       77     E     5
12      mob        1       16     A     1

Run Code Online (Sandbox Code Playgroud)

我想制作80行的样本,由10行(type1 = A),10行(type1 = B)组成,依此类推.有没有人可以帮助他？

Answer 1

Dav*_*urg 11

以下是我将如何使用它 data.table

library(data.table)
indx <- setDT(df)[, .I[sample(.N, 10, replace = TRUE)], by = type1]$V1
df[indx]
#     device geslacht leeftijd type1 type2
#  1:    mob        0       45     C     3
#  2:    mob        0       53     C     3
#  3:    tab        0       75     C     3
#  4:    mob        0       53     C     3
#  5:    tab        0       75     C     3
#  6:    mob        0       45     C     3
#  7:    tab        0       75     C     3
#  8:    mob        0       53     C     3
#  9:    mob        0       53     C     3
# 10:    mob        0       53     C     3
# 11:    mob        1       54     G     7
#...

Run Code Online (Sandbox Code Playgroud)

或者更简单的版本

setDT(df)[, .SD[sample(.N, 10, replace = TRUE)], by = type1]

Run Code Online (Sandbox Code Playgroud)

基本上我们从每组中的行索引进行采样(替换 - 因为每组中少于10行)type1,然后通过该索引对数据进行子集化

同样用dplyr你能做到

library(dplyr)
df %>% 
  group_by(type1) %>%
  sample_n(10, replace = TRUE)

Run Code Online (Sandbox Code Playgroud)

Answer 2

zx8*_*754 7

基础R解决方案:

do.call(rbind,
        lapply(split(df, df$type1), function(i)
          i[sample(1:nrow(i), size = 10, replace = TRUE),]))

Run Code Online (Sandbox Code Playgroud)

编辑:

@BrodieG建议的其他解决方案

with(DF, DF[unlist(lapply(split(seq(type), type), sample, 10, TRUE)), ])

with(DF, DF[c(sapply(split(seq(type), type), sample, 10, TRUE)), ])

Run Code Online (Sandbox Code Playgroud)

Answer 3

Cat*_*ath 5

基地R的另一个选择:

df[as.vector(sapply(unique(df$type1), 
                    function(x){
                        sample(which(df$type1==x), 10, replace=T)
                    })), ]

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，7 月前
查看次数：	1173 次
最近记录：	7 年，7 月前