Dplyr条件窗口

Question

Dplyr条件窗口

试图转换以下R data.frame:

    structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
            Price=c(1,2,3,4,5,6,7,8,9,10),
            Volume=c(100,200,300,100,200,300,100,200,600,100)),
      .Names = c("Time", "Price", "Volume"),
      row.names = c(NA,10L),
      class = "data.frame")

           Time Price Volume
    1  09:30:01     1    100
    2  09:30:29     2    200
    3  09:35:56     3    300
    4  09:37:17     4    100
    5  09:37:21     5    200
    6  09:37:28     6    300
    7  09:37:35     7    100
    8  09:37:51     8    200
    9  09:42:11     9    600
    10 10:00:31    10    100

Run Code Online (Sandbox Code Playgroud)

进入这个

       Time Price  Volume Bin
1  09:30:01     1     100   1
2  09:30:29     2     200   1
3  09:35:56     3     200   1
4  09:35:56     3     100   2
5  09:37:17     4     100   2
6  09:37:21     5     200   2
7  09:37:28     6     100   2
8  09:37:28     6     200   3
9  09:37:35     7     100   3
10 09:37:51     8     200   3
11 09:42:11     9     500   4
12 09:42:11     9     100   5
13 10:00:31    10     100   5

Run Code Online (Sandbox Code Playgroud)

从本质上讲,它是计算累积的总量和每次突破500时的分箱.因此,bin 1为100 + 200 + 200,音量在09:35:56分成200/100并插入一个新行并且bin计数器递增.

对于基础R来说,这是相对简单的,但我想知道是否有更优雅,更有希望用dplyr更快的方式.

干杯

更新:

谢谢@Frank和@AntoniosK.

为了解决您的问题,音量值的范围是从1到10k的所有正整数值.

我微观标记了这两种方法,而dplyr在一个类似于上面的~20万行的数据集上稍微快一些,但并不多.

真的很感激快速的回应和帮助

Answer 1

Ant*_*osK 4

不确定这是否是最好或最快的方法，但对于这些Volume值来说似乎很快。这个哲学很简单。根据Volume您创建的许多行Time和Price的值Volume = 1。然后cumsum每次有新的 500 批次时添加数字和标记。使用这些标志来创造您的Bin价值观。

structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
                Price=c(1,2,3,4,5,6,7,8,9,10),
                Volume=c(100,200,300,100,200,300,100,200,600,100)),
          .Names = c("Time", "Price", "Volume"),
          row.names = c(NA,10L),
          class = "data.frame") -> dt

library(dplyr)

dt %>%
  group_by(Time, Price) %>%                     ## for each Time and Price
  do(data.frame(Volume = rep(1,.$Volume))) %>%  ## create as many rows, with Volume = 1, as the value of Volume
  ungroup() %>%                                 ## forget about the grouping
  mutate(CumSum = cumsum(Volume),               ## cumulative sums 
         flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0),  ## flag 500 batches (at 501, 1001, etc.)
         Bin = cumsum(flag_500)+1) %>%          ## create Bin values 
  group_by(Bin, Time, Price) %>%                ## for each Bin, Time and Price
  summarise(Volume = sum(Volume)) %>%           ## get new Volume values
  select(Time, Price, Volume, Bin) %>%          ## use only if you want to re-arrange column order
  ungroup()                                     ## use if you want to forget the grouping

#        Time Price Volume   Bin
#       (chr) (dbl)  (dbl) (dbl)
# 1  09:30:01     1    100     1
# 2  09:30:29     2    200     1
# 3  09:35:56     3    200     1
# 4  09:35:56     3    100     2
# 5  09:37:17     4    100     2
# 6  09:37:21     5    200     2
# 7  09:37:28     6    100     2
# 8  09:37:28     6    200     3
# 9  09:37:35     7    100     3
# 10 09:37:51     8    200     3
# 11 09:42:11     9    500     4
# 12 09:42:11     9    100     5
# 13 10:00:31    10    100     5

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	208 次
最近记录：	7 年，9 月前