R - 需要帮助加速for循环

Jus*_*tin 5 for-loop r vectorization dataframe data.table

我有两个数据帧; 一个是48行长,看起来像这样:

name = Z31

     Est.Date   Site    Cultivar   Planting
1   24/07/2011  Birchip Axe           1
2   08/08/2011  Birchip Bolac         1
3   24/07/2011  Birchip Derrimut      1
4   12/08/2011  Birchip Eaglehawk     1
5   29/07/2011  Birchip Gregory       1
6   29/07/2011  Birchip Lincoln       1
7   23/07/2011  Birchip Mace          1
8   29/07/2011  Birchip Scout         1
9   17/09/2011  Birchip Axe           2
10  19/09/2011  Birchip Bolac         2
Run Code Online (Sandbox Code Playgroud)

另一个是> 23000行,包含模拟器的输出.它看起来像这样:

name = pred

    Date        maxt    mint    Cultivar    Site    Planting    tt  cum_tt
1   5/05/2011   18       6.5    Axe        Birchip  1        12.25  0
2   6/05/2011   17.5     2.5    Axe        Birchip  1        10     0
3   7/05/2011   18       2.5    Axe        Birchip  1        10.25  0
4   8/05/2011   19.5       2    Axe        Birchip  1        10.75  0
5   9/05/2011   17       4.5    Axe        Birchip  1        10.75  0
6   10/05/2011  15.5    -0.5    Axe        Birchip  1        7.5    0
7   11/05/2011  14       5.5    Axe        Birchip  1        9.75   0
8   12/05/2011  19         8    Axe        Birchip  1        13.5   0
9   13/05/2011  18.5     7.5    Axe        Birchip  1        13     0
10  14/05/2011  16       3.5    Axe        Birchip  1        9.75   0
Run Code Online (Sandbox Code Playgroud)

我想要做的是让cum_tt列开始将当前行的tt列添加到前一行的cum_tt(累积加法),只要前导DF中的日期等于或大于Z31 est.Date .我已经写了以下for循环:

for (i in 1:nrow(Z31)){
  for (j in 1:nrow(pred)){
    if (Z31[i,]$Site == pred[j,]$Site & Z31[i,]$Cultivar == pred[j,]$Cultivar & Z31[i,]$Planting == pred[j,]$Planting &
        pred[j,]$Date >= Z31[i,]$Est.Date)
    {
      pred[j,]$cum_tt <- pred[j,]$tt + pred[j-1,]$cum_tt
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

这有效,但速度太慢,需要大约一个小时才能运行整套.我知道循环不是R的强项,所以任何人都可以协助我进行矢量化操作吗?

提前致谢.

UPDATE

这是dput(Z31)的输出:

structure(list(Est.Date = structure(c(15179, 15194, 15179, 15198, 15184, 15184, 15178, 15184, 15234, 15236, 15230, 15238, 15229, 15236, 15229, 15231, 15155, 15170, 15160, 15168, 15165, 15159, 15170, 15170, 15191, 15205, 15198, 15203, 15202, 15195, 15203, 15206, 15193, 15193, 15195, 15200, 15193, 15205, 15200, 15205, 15226, 15245, 15231, 15259, 15241, 15241, 15241, 15241), class = "Date"), Site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Birchip", "Gatton", "Tarlee"), class = "factor"), Cultivar = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("Axe", "Bolac", "Derrimut", "Eaglehawk", "Gregory", "Lincoln", "Mace", "Scout"), class = "factor"), Planting = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("Est.Date", "Site", "Cultivar", "Planting"), row.names = c(NA, -48L), class = "data.frame")

这是预测.请注意,这里有一些额外的colums.为了便于阅读,我刚刚将上面的相关内容包括在内.

structure(list(Date = structure(c(15099, 15100, 15101, 15102, 
15103, 15104, 15105, 15106, 15107, 15108, 15109, 15110, 15111, 
15112, 15113, 15114, 15115, 15116, 15117, 15118), class = "Date"), 
    flowering_das = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Zadok = c(9, 9, 
    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 11, 11.032, 11.085, 
    11.157), stagename = structure(c(8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 9L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 1L, 3L, 3L, 3L), .Label = c("emergence", 
    "end_grain_fill", "end_of_juvenil", "floral_initiat", "flowering", 
    "germination", "maturity", "out", "sowing", "start_grain_fi"
    ), class = "factor"), node_no = c(0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 2, 2.032, 2.085, 2.157), maxt = c(18, 
    17.5, 18, 19.5, 17, 15.5, 14, 19, 18.5, 16, 16, 15, 16.5, 
    16.5, 20.5, 23, 25.5, 16.5, 16.5, 15), mint = c(6.5, 2.5, 
    2.5, 2, 4.5, -0.5, 5.5, 8, 7.5, 3.5, 6, 1, 5.5, 2, 7, 7, 
    9, 13.5, 11.5, 8.5), Cultivar = c("Axe", "Axe", "Axe", "Axe", 
    "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", 
    "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe"), Site = c("Birchip", 
    "Birchip", "Birchip", "Birchip", "Birchip", "Birchip", "Birchip", 
    "Birchip", "Birchip", "Birchip", "Birchip", "Birchip", "Birchip", 
    "Birchip", "Birchip", "Birchip", "Birchip", "Birchip", "Birchip", 
    "Birchip"), Planting = c("1", "1", "1", "1", "1", "1", "1", 
    "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
    "1"), `NA` = c("Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", 
    "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", 
    "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", 
    "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", 
    "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", 
    "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", 
    "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out"
    ), tt = c(12.25, 10, 10.25, 10.75, 10.75, 7.5, 9.75, 13.5, 
    13, 9.75, 11, 8, 11, 9.25, 13.75, 15, 17.25, 15, 14, 11.75
    ), cum_tt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0)), .Names = c("Date", "flowering_das", "Zadok", 
"stagename", "node_no", "maxt", "mint", "Cultivar", "Site", "Planting", 
NA, "tt", "cum_tt"), row.names = c(NA, 20L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

UPDATE

谢谢大家的帮助.我仍然是传统做事的新手,我无法及时实施一些更复杂的解决方案.对于Subs建议的方式,我有一些时间.它足够快,现在可以做我需要的.对于Z的一次迭代,这些数字以秒为单位.

我的方式:59.77

潜艇:14.62

使用数字日期的子订单:11.12

Tim*_*m P 5

当然,我们可以在几秒钟内做到这一点......我的第一个答案就在这里温柔!

## first make sure we have dates in a suitable format for comparison
## by using strptime, creating the columns estdate_tidy and date_tidy
## in Z31 and pred respectively

Z31$estdate_tidy = strptime(as.character(Z31$Est.Date), "%d/%m/%Y")
pred$date_tidy = strptime(as.character(pred$Date), "%d/%m/%Y")

## now map the estdate_tidy column over to pred using the match command -
## Z31_m and pred_m are dummy variables that hopefully make this clear

Z31_m = paste(Z31$Site, Z31$Cultivar, Z31$Planting)
pred_m = paste(pred$Site, pred$Cultivar, pred$Planting)
pred$estdate_tidy = Z31$estdate_tidy[match(pred_m, Z31_m)]

## then define a ttfilter column that copies tt, except for being 0 when
## estdate_tidy is after date_tidy (think this is what you described)

pred$ttfilter = ifelse(pred$date_tidy >= pred$estdate_tidy, pred$tt, 0)

## finally use cumsum function to sum this new column up (looks like you
## wanted the answer in cum_tt so I've put it there)

pred$cum_tt = cumsum(pred$ttfilter)
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助 :)

更新(6月7日):

处理新规范的矢量化代码 - 即每个条件(网站/品种/种植)应单独完成的cumsum - 如下所示:

Z31$Lookup=with(Z31,paste(Site,Cultivar,Planting,sep="~"))
Z31$LookupNum=match(Z31$Lookup,unique(Z31$Lookup))
pred$Lookup=with(pred,paste(Site,Cultivar,Planting,sep="~"))
pred$LookupNum=match(pred$Lookup,unique(pred$Lookup))

pred$Est.Date = Z31$Est.Date[match(pred$Lookup,Z31$Lookup)]
pred$ttValid = (pred$Date>=pred$Est.Date)
pred$ttFiltered = ifelse(pred$ttValid, pred$tt, 0)

### now fill in cumsum of ttFiltered separately for each LookupNum
pred$cum_tt_Z31 = as.vector(unlist(tapply(pred$ttFiltered,
                                          pred$LookupNum,cumsum)))
Run Code Online (Sandbox Code Playgroud)

我的机器上的运行时间是0.16秒,最后一pred$cum_tt_Z31列与非矢量化代码的答案完全匹配:)

为了完整起见,值得注意的是,上面的最后一个复杂的tapply行可以用以下更简单的方法替换,在48种可能的情况下使用短循环:

pred$cum_tt_Z31 = rep(NA, nrow(pred))
for (lookup in unique(pred$Lookup)) {
    subs = which(pred$Lookup==lookup)
    pred$cum_tt_Z31[subs] = cumsum(pred$ttFiltered[subs])
}
Run Code Online (Sandbox Code Playgroud)

运行时只会略微增加到0.25秒左右,因为这里的循环非常小,并且在循环内完成的工作是矢量化的.

认为我们破解了它!:)

关于矢量化的一些快速观察(6月8日):

矢量化过程步骤的过程使运行时间从接近一小时减少到总共0.16秒.即使允许不同的机器速度,这也是至少10,000倍的加速,这使得2-5的因素相形见绌,这可能来自于进行微小调整但仍保留循环结构.

要做的第一个关键观察:在解决方案中,每一行都创建 - 没有循环 - 一个与Z31或pred中的列长度相同的整个新向量.为了整洁,我经常发现将这些新向量创建为新的数据框列很有用,但显然这不是必需的.

第二个观察:使用"paste'n'match"策略将所需的Est.Date列从Z31正确地转移到pred.有这种任务的替代方法(例如使用合并),但我采用这条路线,因为它完全是故障保护并保证在pred中保持顺序(这是至关重要的).基本上,粘贴操作只允许您一次匹配多个字段,因为如果粘贴的字符串匹配,那么它们的所有组成部分都匹配.我使用〜作为分隔符(假设我知道符号不会出现在任何字段中),以避免粘贴操作导致任何歧义.如果你使用空格分隔符,那么粘贴像"AB","C","D"这样的东西会得到与粘贴相同的结果("A","BC","D") - 我们希望避免任何头痛!

第三个观察:很容易对逻辑运算进行矢量化,例如查看一个向量是否超过另一个向量(参见pred $ ttValid),或者根据向量的值选择一个值或另一个值(请参阅pred $ ttFiltered).在目前的情况下,这些可以合并为一行,但作为一个示范,我把事情打破了一点.

第四个观察:创建pred $ cum_tt_Z31的最后一行基本上只使用tapply对应于pred $ LookupNum的每个单独值的行进行cumsum操作,这允许你在不同的行组上应用相同的函数(在这里,我们'通过pred $ LookupNum进行分组).pred $ LookupNum的定义在这里有很大帮助 - 它是一个块为1的数字索引,后跟一个2s的块,依此类推.这意味着从tapply出来的最终cumsum向量列表可以简单地不列出并放入向量中并自动按正确的顺序排列.如果你按照这样的顺序进行tapply和拆分,你通常需要一些额外的代码行才能正确地重新匹配(尽管它并不棘手).

最后的观察:如果最终的tapply太可怕了,那么值得强调的是,如果循环中的工作很好地被矢量化,那么对少数情况(48说)的快速循环并不总是灾难性的.UPDATE部分末尾的"替代方法"表明,通过预先准备一个答案列(最初是所有NAs),然后逐个完成48个子集并填写,可以实现cumsum-on-groups步骤.每个块都有适当的cumsum.然而,如文中所述,这一步骤的速度大约是聪明的tapply方法的一半,如果需要更多的子集,这将会相当慢.

如果有人对此类任务有任何后续问题,请不要犹豫,给我一个喊.