ema*_*ino 13 parallel-processing r multiple-mice r-mice
我希望通过使用micein 来运行150次多次插补R.然而,为了节省一些计算时间,我将谎言将并行流中的流程细分(如Stef van Buuren在"针对缺失数据的灵活插补"中所建议的那样).
我的问题是:怎么做?
我可以想象2个选项:
opt.1:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
Run Code Online (Sandbox Code Playgroud)
然后通过使用complete和as.mids之后将插补结合在一起
opt.2:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
Run Code Online (Sandbox Code Playgroud)
通过添加VAL_1to150否则,在我看来(我可能是错的),如果它们都使用相同的数据集和相同的种子运行,那么您将获得相同结果的150倍.
还有其他选择吗?
谢谢
Max*_*don 27
所以主要的问题是结合插补,正如我所看到的,有三种选择,使用ibind,complete如描述或试图保持中间结构.我强烈建议ibind解决方案.其他人留给那些好奇的答案.
在做任何事情之前,我们需要获得并行鼠标插补.并行部分相当简单,我们需要做的就是使用并行包并确保我们使用以下方法设置种子clusterSetRNGStream:
library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something
# else during the time the code runs
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <-
parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
mice(nhanes, m = 30, printFlag = FALSE)
})
stopCluster(cl)
Run Code Online (Sandbox Code Playgroud)
以上将产生cores_2_use * 30推算数据集.
ibind正如@AleksanderBlekh建议的那样,这mice::ibind可能是最好,最直接的解决方案:
imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
imp_merged <-
ibind(imp_merged,
imp_pars[[n]])
}
Run Code Online (Sandbox Code Playgroud)
foreach与ibind也许最简单的替代方案是使用foreach:
library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)
library(mice)
imp_merged <-
foreach(no = 1:cores_2_use,
.combine = ibind,
.export = "nhanes",
.packages = "mice") %dopar%
{
mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)
Run Code Online (Sandbox Code Playgroud)
complete使用complete(..., action="long"),rbind-ing这些然后使用as.mids其他mice对象提取完整数据集可能效果很好,但它会生成比其他两个方法更细的对象:
merged_df <- nhanes
merged_df <-
cbind(data.frame(.imp = 0,
.id = 1:nrow(nhanes)),
merged_df)
for (n in 1:length(imp_pars)){
tmp <- complete(imp_pars[[n]], action = "long")
tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
merged_df <-
rbind(merged_df,
tmp)
}
imp_merged <-
as.mids(merged_df)
# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = 60,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Run Code Online (Sandbox Code Playgroud)
给出输出:
est se est se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age -3.56928102 1.35801557 -3.65568620 1.27603817
hyp 1.63952970 2.05618895 1.60216683 2.17650536
chl 0.05396451 0.02278867 0.05525561 0.02087995
Run Code Online (Sandbox Code Playgroud)
我在下面的替代方法显示了如何合并插补对象并保留mids对象背后的完整功能.从ibind解决方案开始,我就把它留给了有兴趣探索如何合并复杂列表的人.
我已经查看过micemids-object了,为了在并行运行后得到至少一个类似的mids-object,你需要采取一些步骤.如果我们检查mids-object并将两个对象与两个不同的设置进行比较,我们得到:
library(mice)
imp <- list()
imp <- c(imp,
list(mice(nhanes, m = 40)))
imp <- c(imp,
list(mice(nhanes, m = 20)))
sapply(names(imp[[1]]),
function(n)
try(all(useful::compare.list(imp[[1]][[n]],
imp[[2]][[n]]))))
Run Code Online (Sandbox Code Playgroud)
您可以看到两次运行之间的调用,m,imp,chainMean和chainVar不同.除此之外,imp无疑是最重要的,但似乎也是更新其他组件的明智选择.因此,我们将首先构建一个鼠标合并功能:
mergeMice <- function (imp) {
merged_imp <- NULL
for (n in 1:length(imp)){
if (is.null(merged_imp)){
merged_imp <- imp[[n]]
}else{
counter <- merged_imp$m
# Update counter
merged_imp$m <-
merged_imp$m + imp[[n]]$m
# Rename chains
dimnames(imp[[n]]$chainMean)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
dimnames(imp[[n]]$chainVar)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
# Merge chains
merged_imp$chainMean <-
abind::abind(merged_imp$chainMean,
imp[[n]]$chainMean)
merged_imp$chainVar <-
abind::abind(merged_imp$chainVar,
imp[[n]]$chainVar)
for (nn in names(merged_imp$imp)){
# Non-imputed variables are not in the
# data.frame format but are null
if (!is.null(imp[[n]]$imp[[nn]])){
colnames(imp[[n]]$imp[[nn]]) <-
(counter + 1):merged_imp$m
merged_imp$imp[[nn]] <-
cbind(merged_imp$imp[[nn]],
imp[[n]]$imp[[nn]])
}
}
}
}
# TODO: The function should update the $call parameter
return(merged_imp)
}
Run Code Online (Sandbox Code Playgroud)
我们现在可以通过以下方式简单地合并上面生成的两个插补:
merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)
Run Code Online (Sandbox Code Playgroud)
现在看来我们得到了正确的输出:
# Compare the three alternatives
cbind(
summary(pool(with(data=merged_imp,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=merged_imp_pars,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = merged_imp$m,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Run Code Online (Sandbox Code Playgroud)
得到:
est se est se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age -3.67906629 1.19873118 -3.64395716 1.1476377
hyp 1.72637216 2.01171565 1.71063127 1.9936347
chl 0.05590999 0.02350609 0.05476829 0.0213819
est se
(Intercept) 20.14271905 3.60702992
age -3.78345532 1.21550474
hyp 1.77361005 2.11415290
chl 0.05648672 0.02046868
Run Code Online (Sandbox Code Playgroud)
好的,就是这样.玩得开心.