试图转换以下R data.frame:
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame")
Time Price Volume
1 09:30:01 1 100
2 09:30:29 2 200
3 09:35:56 3 300
4 09:37:17 4 100
5 09:37:21 5 200
6 09:37:28 6 300
7 09:37:35 7 100
8 09:37:51 8 200
9 09:42:11 9 600
10 10:00:31 10 100
Run Code Online (Sandbox Code Playgroud)
进入这个
Time Price Volume Bin
1 09:30:01 1 100 1
2 09:30:29 2 200 1
3 09:35:56 3 200 1
4 09:35:56 3 100 2
5 09:37:17 4 100 2
6 09:37:21 5 200 2
7 09:37:28 6 100 2
8 09:37:28 6 200 3
9 09:37:35 7 100 3
10 09:37:51 8 200 3
11 09:42:11 9 500 4
12 09:42:11 9 100 5
13 10:00:31 10 100 5
Run Code Online (Sandbox Code Playgroud)
从本质上讲,它是计算累积的总量和每次突破500时的分箱.因此,bin 1为100 + 200 + 200,音量在09:35:56分成200/100并插入一个新行并且bin计数器递增.
对于基础R来说,这是相对简单的,但我想知道是否有更优雅,更有希望用dplyr更快的方式.
干杯
更新:
谢谢@Frank和@AntoniosK.
为了解决您的问题,音量值的范围是从1到10k的所有正整数值.
我微观标记了这两种方法,而dplyr在一个类似于上面的~20万行的数据集上稍微快一些,但并不多.
真的很感激快速的回应和帮助
不确定这是否是最好或最快的方法,但对于这些Volume值来说似乎很快。这个哲学很简单。根据Volume您创建的许多行Time和Price的值Volume = 1。然后cumsum每次有新的 500 批次时添加数字和标记。使用这些标志来创造您的Bin价值观。
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame") -> dt
library(dplyr)
dt %>%
group_by(Time, Price) %>% ## for each Time and Price
do(data.frame(Volume = rep(1,.$Volume))) %>% ## create as many rows, with Volume = 1, as the value of Volume
ungroup() %>% ## forget about the grouping
mutate(CumSum = cumsum(Volume), ## cumulative sums
flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0), ## flag 500 batches (at 501, 1001, etc.)
Bin = cumsum(flag_500)+1) %>% ## create Bin values
group_by(Bin, Time, Price) %>% ## for each Bin, Time and Price
summarise(Volume = sum(Volume)) %>% ## get new Volume values
select(Time, Price, Volume, Bin) %>% ## use only if you want to re-arrange column order
ungroup() ## use if you want to forget the grouping
# Time Price Volume Bin
# (chr) (dbl) (dbl) (dbl)
# 1 09:30:01 1 100 1
# 2 09:30:29 2 200 1
# 3 09:35:56 3 200 1
# 4 09:35:56 3 100 2
# 5 09:37:17 4 100 2
# 6 09:37:21 5 200 2
# 7 09:37:28 6 100 2
# 8 09:37:28 6 200 3
# 9 09:37:35 7 100 3
# 10 09:37:51 8 200 3
# 11 09:42:11 9 500 4
# 12 09:42:11 9 100 5
# 13 10:00:31 10 100 5
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
208 次 |
| 最近记录: |