mjd*_*dub 5 for-loop r vectorization
我正在尝试清理这段代码,并且想知道是否有人对如何在 R 中运行它而无需循环有任何建议。我有一个名为 data 的数据集,包含 100 个变量和 200,000 个观察值。我想要做的本质上是通过将每个观察值乘以特定标量来扩展数据集,然后将数据组合在一起。最后,我需要一个包含 800,000 个观察值(我有四个类别要创建)和 101 个变量的数据集。这是我编写的一个循环来执行此操作,但效率非常低,我想要更快、更有效的东西。
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Run Code Online (Sandbox Code Playgroud)
请让我知道您的想法并感谢您的帮助。
下面是一些具有 2K 个观测值而不是 200K 个观测值的示例数据
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
Run Code Online (Sandbox Code Playgroud)
您可以尝试以下操作。请注意,我们用调用替换了前两个 for 循环mapply,用调用 lapply 替换了第三个 for 循环。此外,我们正在创建两个向量,将它们组合起来进行向量化乘法。
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Run Code Online (Sandbox Code Playgroud)
尽量不要使用诸如data, table,df等sub常用函数的词,在上面的代码中我用mydf来代替data。
您可以使用apply(ixk, 1, fu..)代替mapply我使用的,但我认为在这种情况下 mapply 可以使代码更清晰