如何使用"采样"包在R中创建分层样本?我的数据集有355,000个观测值.代码在最后一行工作正常.下面是我写的代码,但我总是得到以下消息:"sort.list(y)中的错误:'x'必须是'sort.list'的原子'你在列表上调用了'sort'吗?"
请不要指向Stackoverflow上的旧消息.我研究了它们,但一直无法使用它们.谢谢.
## lpdata file has 355,000 observations
# Exclude Puerto Rico, Virgin Islands and Guam
sub.lpdata<-subset(lpdata,"STATE" != 'PR' | "STATE" != 'VI' | "STATE" != 'GU')
## Create a 10% sample, stratified by STATE
sort.lpdata<-sub.lpdata[order(sub.lpdata$STATE),]
tab.state<-data.frame(table(sort.lpdata$STATE))
size.strata<-as.vector(round(ceiling(tab.state$Freq)*0.1))
s<-strata(sort.lpdata,stratanames=sort.lpdata$STATE,size=size.strata,method="srswor")}
Run Code Online (Sandbox Code Playgroud)
去年我不得不做类似的事情.如果这是你做了很多事情,你可能想要使用如下所示的功能.此功能允许您指定要从中采样的数据框的名称,哪个变量是ID变量,即层,如果要使用"set.seed".您可以将功能保存为"stratified.R"之类的功能,并在需要时加载.见http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/
stratified = function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
}
Run Code Online (Sandbox Code Playgroud)
在不知道 strata 函数的情况下 - 一些编码可能会达到想要的效果:
d <- expand.grid(id = 1:35000, stratum = letters[1:10])
p = 0.1
dsample <- data.frame()
system.time(
for(i in levels(d$stratum)) {
dsub <- subset(d, d$stratum == i)
B = ceiling(nrow(dsub) * p)
dsub <- dsub[sample(1:nrow(dsub), B), ]
dsample <- rbind(dsample, dsub)
}
)
# size per stratum in resulting df is 10 % of original size:
table(dsample$stratum)
Run Code Online (Sandbox Code Playgroud)
HTH,凯
PS:我的旧笔记本电脑上的 CPU 时间是 0.09!