如何使用dplyr在R中"级联"计算值

bri*_*enb 2 r dplyr

假设我有一个data_frame看起来像这样:

dput(df)
structure(list(Name = c("John Smith", "John Smith", "John Smith", 
"John Smith", "John Smith"), Account_Number = c("XXXX XXXX 0000", 
"XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000"
), Transaction_Date = c("04/01/16", "04/02/16", "04/03/16", "04/04/16", 
"04/05/16"), Amount = c(NA, 749, -256, 392, NA), Balance = c(2000, 
NA, NA, NA, 1500)), .Names = c("Name", "Account_Number", "Transaction_Date", 
"Amount", "Balance"), row.names = c(NA, 5L), class = c("tbl_df", 
"tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)

为了便于查看问题,这里打印出来:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749      NA
#3 John Smith XXXX XXXX 0000         04/03/16   -256      NA
#4 John Smith XXXX XXXX 0000         04/04/16    392      NA
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500
Run Code Online (Sandbox Code Playgroud)

我想做的是用总和填写NA值.我认为使用以下内容可以轻松完成此操作:BalanceBalance[i-1] + Amount[i]dplyr

library(lubridate)
library(dplyr)
df %>%
  arrange(mdy(Transaction_Date)) %>%
  mutate(Balance = ifelse(is.na(Balance), as.numeric(lag(Balance)) + as.numeric(Amount), Balance))
Run Code Online (Sandbox Code Playgroud)

不幸的是,这给了我以下内容:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749    2749
#3 John Smith XXXX XXXX 0000         04/03/16   -256      NA
#4 John Smith XXXX XXXX 0000         04/04/16    392      NA
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500
Run Code Online (Sandbox Code Playgroud)

因此,似乎所有的值都是同时计算的,而我想要的是逐行计算.

期望的结果如下:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749    2749
#3 John Smith XXXX XXXX 0000         04/03/16   -256    2493
#4 John Smith XXXX XXXX 0000         04/04/16    392    2885
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500
Run Code Online (Sandbox Code Playgroud)

我相信我可以使用apply,但dplyr如果可能的话,我更愿意将其保留在管道中.提前感谢任何提示.

更新:

根据这个问题,它看起来像我可以使用RcppRoll::roll_sum,但该功能看起来只需要一个变量,而我需要使用两个.所以我也接受一个演示如何使用该功能的答案.

ali*_*ire 5

编辑:警告!

这里介绍的原始方法无法Balance正确处理重置,因为您将看到是否通过了它df %>% bind_rows(df).我只是把它留在这里,因为它是公认的答案.请参阅下文,了解避免此问题的更新方法.


原始[错误]方法

你真的需要一笔累积金额,但cumsum在这里使用是一种痛苦,因为它没有na.rm争论.但是,您可以删除并重新插入NA值:

# replace NAs with 0s so cumsum will work
df %>% mutate_each(funs(ifelse(is.na(.), 0, .)), Balance, Amount) %>% 
    # replace 0 values in Balance with cumsum of Balance and Amount
    mutate(Balance = ifelse(Balance == 0, cumsum(Balance + Amount), Balance)) %>% 
    # put NAs back
    mutate(Amount = ifelse(Amount == 0, NA, Amount))

# Source: local data frame [5 x 5]
# 
#         Name Account_Number Transaction_Date Amount Balance
#        (chr)          (chr)            (chr)  (dbl)   (dbl)
# 1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
# 2 John Smith XXXX XXXX 0000         04/02/16    749    2749
# 3 John Smith XXXX XXXX 0000         04/03/16   -256    2493
# 4 John Smith XXXX XXXX 0000         04/04/16    392    2885
# 5 John Smith XXXX XXXX 0000         04/05/16     NA    1500
Run Code Online (Sandbox Code Playgroud)

请注意,如果您有实际0BalanceAmount(或者如果可能),您可能需要使方法更加健壮.


新的[运作]方法

通过时的运行长度的分组AmountNA,我们可以确保我们增加了正确的累计总和,而不是增加Amount的复位前值Balance:

# pass it a bigger df to test
df %>% bind_rows(df) %>% 
    # replace NAs with last value
    tidyr::fill(Balance) %>% 
    # group so cumsums are not added after Balance reset
    group_by(NA_Amount = is.na(Amount), 
             rle_Amount = data.table::rleid(NA_Amount)) %>%
    mutate(Balance = ifelse(NA_Amount, Balance, Balance + cumsum(Amount))) %>%
    # clean up columns
    ungroup() %>% select(-NA_Amount, -rle_Amount)

# Source: local data frame [10 x 5]
# 
#          Name Account_Number Transaction_Date Amount Balance
#         (chr)          (chr)            (chr)  (dbl)   (dbl)
# 1  John Smith XXXX XXXX 0000         04/01/16     NA    2000
# 2  John Smith XXXX XXXX 0000         04/02/16    749    2749
# 3  John Smith XXXX XXXX 0000         04/03/16   -256    2493
# 4  John Smith XXXX XXXX 0000         04/04/16    392    2885
# 5  John Smith XXXX XXXX 0000         04/05/16     NA    1500
# 6  John Smith XXXX XXXX 0000         04/01/16     NA    2000
# 7  John Smith XXXX XXXX 0000         04/02/16    749    2749
# 8  John Smith XXXX XXXX 0000         04/03/16   -256    2493
# 9  John Smith XXXX XXXX 0000         04/04/16    392    2885
# 10 John Smith XXXX XXXX 0000         04/05/16     NA    1500
Run Code Online (Sandbox Code Playgroud)