假设我有一个data_frame看起来像这样:
dput(df)
structure(list(Name = c("John Smith", "John Smith", "John Smith",
"John Smith", "John Smith"), Account_Number = c("XXXX XXXX 0000",
"XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000"
), Transaction_Date = c("04/01/16", "04/02/16", "04/03/16", "04/04/16",
"04/05/16"), Amount = c(NA, 749, -256, 392, NA), Balance = c(2000,
NA, NA, NA, 1500)), .Names = c("Name", "Account_Number", "Transaction_Date",
"Amount", "Balance"), row.names = c(NA, 5L), class = c("tbl_df",
"tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)
为了便于查看问题,这里打印出来:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 NA
#3 John Smith XXXX XXXX 0000 04/03/16 -256 NA
#4 John Smith XXXX XXXX 0000 04/04/16 392 NA
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
Run Code Online (Sandbox Code Playgroud)
我想做的是用总和填写NA值.我认为使用以下内容可以轻松完成此操作:BalanceBalance[i-1] + Amount[i]dplyr
library(lubridate)
library(dplyr)
df %>%
arrange(mdy(Transaction_Date)) %>%
mutate(Balance = ifelse(is.na(Balance), as.numeric(lag(Balance)) + as.numeric(Amount), Balance))
Run Code Online (Sandbox Code Playgroud)
不幸的是,这给了我以下内容:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 2749
#3 John Smith XXXX XXXX 0000 04/03/16 -256 NA
#4 John Smith XXXX XXXX 0000 04/04/16 392 NA
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
Run Code Online (Sandbox Code Playgroud)
因此,似乎所有的值都是同时计算的,而我想要的是逐行计算.
期望的结果如下:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 2749
#3 John Smith XXXX XXXX 0000 04/03/16 -256 2493
#4 John Smith XXXX XXXX 0000 04/04/16 392 2885
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
Run Code Online (Sandbox Code Playgroud)
我相信我可以使用apply,但dplyr如果可能的话,我更愿意将其保留在管道中.提前感谢任何提示.
根据这个问题,它看起来像我可以使用RcppRoll::roll_sum,但该功能看起来只需要一个变量,而我需要使用两个.所以我也接受一个演示如何使用该功能的答案.
这里介绍的原始方法无法Balance正确处理重置,因为您将看到是否通过了它df %>% bind_rows(df).我只是把它留在这里,因为它是公认的答案.请参阅下文,了解避免此问题的更新方法.
你真的需要一笔累积金额,但cumsum在这里使用是一种痛苦,因为它没有na.rm争论.但是,您可以删除并重新插入NA值:
# replace NAs with 0s so cumsum will work
df %>% mutate_each(funs(ifelse(is.na(.), 0, .)), Balance, Amount) %>%
# replace 0 values in Balance with cumsum of Balance and Amount
mutate(Balance = ifelse(Balance == 0, cumsum(Balance + Amount), Balance)) %>%
# put NAs back
mutate(Amount = ifelse(Amount == 0, NA, Amount))
# Source: local data frame [5 x 5]
#
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
# 1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
# 2 John Smith XXXX XXXX 0000 04/02/16 749 2749
# 3 John Smith XXXX XXXX 0000 04/03/16 -256 2493
# 4 John Smith XXXX XXXX 0000 04/04/16 392 2885
# 5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
Run Code Online (Sandbox Code Playgroud)
请注意,如果您有实际0值Balance或Amount(或者如果可能),您可能需要使方法更加健壮.
通过时的运行长度的分组Amount是NA,我们可以确保我们增加了正确的累计总和,而不是增加Amount的复位前值Balance:
# pass it a bigger df to test
df %>% bind_rows(df) %>%
# replace NAs with last value
tidyr::fill(Balance) %>%
# group so cumsums are not added after Balance reset
group_by(NA_Amount = is.na(Amount),
rle_Amount = data.table::rleid(NA_Amount)) %>%
mutate(Balance = ifelse(NA_Amount, Balance, Balance + cumsum(Amount))) %>%
# clean up columns
ungroup() %>% select(-NA_Amount, -rle_Amount)
# Source: local data frame [10 x 5]
#
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
# 1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
# 2 John Smith XXXX XXXX 0000 04/02/16 749 2749
# 3 John Smith XXXX XXXX 0000 04/03/16 -256 2493
# 4 John Smith XXXX XXXX 0000 04/04/16 392 2885
# 5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
# 6 John Smith XXXX XXXX 0000 04/01/16 NA 2000
# 7 John Smith XXXX XXXX 0000 04/02/16 749 2749
# 8 John Smith XXXX XXXX 0000 04/03/16 -256 2493
# 9 John Smith XXXX XXXX 0000 04/04/16 392 2885
# 10 John Smith XXXX XXXX 0000 04/05/16 NA 1500
Run Code Online (Sandbox Code Playgroud)