如何在R中的连续行中对滚动的cumsum进行滚动

pss*_*guy 11 r tidyverse tibble

我有一个玩具的例子.对由x组成的两个连续y行进行求和的最有效方法是什么


library(tibble)
l = list(x = c("a", "b", "a", "b", "a", "b"), y = c(1, 4, 3, 3, 7, 0))

df <- as_tibble(l)
df
#> # A tibble: 6 x 2
#>       x     y
#>   <chr> <dbl>
#> 1     a     1
#> 2     b     4
#> 3     a     3
#> 4     b     3
#> 5     a     7
#> 6     b     0
Run Code Online (Sandbox Code Playgroud)

所以输出就是这样的

   group   sum  seq
     a      4     1
     a     10     2
     b      7     1
     b      3     2
Run Code Online (Sandbox Code Playgroud)

我想使用RcppRoll包中的tidyverse和可能的roll_sum()并使用代码,以便可变长度的连续行可以用于真实世界数据,其中会有很多组

TIA

Psi*_*dom 7

一种方法是使用group_by %>% do,您可以在其中自定义返回的数据框do:

library(RcppRoll); library(tidyverse)

n = 2
df %>% 
    group_by(x) %>% 
    do(
        data.frame(
            sum = roll_sum(.$y, n), 
            seq = seq_len(length(.$y) - n + 1)
        )
    )

# A tibble: 4 x 3
# Groups:   x [2]
#      x   sum   seq
#  <chr> <dbl> <int>
#1     a     4     1
#2     a    10     2
#3     b     7     1
#4     b     3     2
Run Code Online (Sandbox Code Playgroud)

编辑:由于这不是那么有效,可能是由于数据框构造标题和移动中的绑定数据帧,这里是一个改进版本(仍然比data.table现在慢一点但不是很多):

df %>% 
    group_by(x) %>% 
    summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>%
    unnest()
Run Code Online (Sandbox Code Playgroud)

时间,使用@Matt的数据和设置:

library(tibble)
library(dplyr)
library(RcppRoll)
library(stringi) ## Only included for ability to generate random strings

## Generate data with arbitrary number of groups and rows --------------
rowCount   <- 100000
groupCount <- 10000
sumRows    <- 2L
set.seed(1)

l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE),
            y = sample(0:10,rowCount,rep=TRUE))

## Using dplyr and tibble -----------------------------------------------

ptm <- proc.time() ## Start the clock

dplyr_result <- l %>% 
    group_by(x) %>% 
    summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>%
    unnest()


dplyr_time <- proc.time() - ptm ## Stop the clock

## Using data.table instead ----------------------------------------------

library(data.table)

ptm <- proc.time() ## Start the clock

setDT(l) ## Convert l to a data.table
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"),
                  seq = seq_len(.N)),
               keyby = .(x)][!is.na(sum)]

data.table_time <- proc.time() - ptm
Run Code Online (Sandbox Code Playgroud)

结果是:

dplyr_time
#   user  system elapsed 
#  0.688   0.003   0.689 
data.table_time
#   user  system elapsed 
#  0.422   0.009   0.430 
Run Code Online (Sandbox Code Playgroud)


jaz*_*rro 6

这是给你的一种方法.由于您要总结两个连续的行,您可以使用lead()并进行计算sum.因为seq,我认为你可以简单地采用行数,看看你的预期结果.完成这些操作后,您可以x(如有必要,xseq)安排数据.最后,删除具有NA的行.如有必要,您可能希望y通过select(-y)在代码末尾写入来删除.

group_by(df, x) %>%
mutate(sum = y + lead(y),
       seq = row_number()) %>%
arrange(x) %>%
ungroup %>%
filter(complete.cases(.))

#      x     y   sum   seq
#  <chr> <dbl> <dbl> <int>
#1     a     1     4     1
#2     a     3    10     2
#3     b     4     7     1
#4     b     3     3     2
Run Code Online (Sandbox Code Playgroud)


Mat*_*ill 5

我注意到你要求最有效的方法 -如果你正在考虑将其扩展到更大的集合,我强烈建议使用data.table.

library(data.table)
library(RcppRoll)

l[, .(sum = RcppRoll::roll_sum(y, n = 2L, fill = NA, align = "left"),
      seq = seq_len(.N)),
  keyby = .(x)][!is.na(sum)]
Run Code Online (Sandbox Code Playgroud)

使用具有100,000行和10,000组的tidyverse包的答案的粗略基准比较说明了显着的差异.

(我使用了Psidom的答案,而不是jazzurro的答案,因为jazzuro不允许对一些行的总数进行求和.)

library(tibble)
library(dplyr)
library(RcppRoll)
library(stringi) ## Only included for ability to generate random strings

## Generate data with arbitrary number of groups and rows --------------
rowCount   <- 100000
groupCount <- 10000
sumRows    <- 2L
set.seed(1)

l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE),
            y = sample(0:10,rowCount,rep=TRUE))

## Using dplyr and tibble -----------------------------------------------

ptm <- proc.time() ## Start the clock

dplyr_result <- l %>% 
    group_by(x) %>% 
    do(
        data.frame(
            sum = roll_sum(.$y, sumRows), 
            seq = seq_len(length(.$y) - sumRows + 1)
        )
    )
|========================================================0% ~0 s remaining     

dplyr_time <- proc.time() - ptm ## Stop the clock

## Using data.table instead ----------------------------------------------

library(data.table)

ptm <- proc.time() ## Start the clock

setDT(l) ## Convert l to a data.table
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"),
                  seq = seq_len(.N)),
               keyby = .(x)][!is.na(sum)]

data.table_time <- proc.time() - ptm ## Stop the clock
Run Code Online (Sandbox Code Playgroud)

结果:

> dplyr_time
  user  system elapsed 
  10.28    0.04   10.36 
> data.table_time
   user  system elapsed 
   0.35    0.02    0.36 

> all.equal(dplyr_result,as.tibble(dt_result))
[1] TRUE
Run Code Online (Sandbox Code Playgroud)