R数据表 - 使用当前行之前的所有行计算每一行

tuc*_*son 11 r data.table

我希望通过id和顺序(时间)来计算不同的东西.例如,用:

dt = data.table( id=c(1,1,1,2,2,2,3,3,3), hour=c(1,5,5,6,7,8,23,23,23), ip=c(1,1,45,2,2,2,3,1,1), target=c(1,0,0,1,1,1,1,1,0), day=c(1,1,1,1,1,1,3,2,1))

   id hour ip target day
1:  1    1  1      1   1
2:  1    5  1      0   1
3:  1    5 45      0   1
4:  2    6  2      1   1
5:  2    7  2      1   1
6:  2    8  2      1   1
7:  3   23  3      1   3
8:  3   23  1      1   2
9:  3   23  1      0   1
Run Code Online (Sandbox Code Playgroud)

我希望来算,每个ID,活跃天数,和有效时间,到目前为止,对于每一行.这意味着我希望获得以下输出:

   id hour ip target day  nb_active_hours_so_far
1:  1    1  1      1   1  0  (first occurence of id when ordered by hour)
2:  1    5  1      0   1  1  (has been active in hour "1")
3:  1    5 45      0   1  2  (has been active in hour "1" and "5")
4:  2    6  2      1   1  0  (first occurence)
5:  2    7  2      1   1  1  (has been active in hour "6")
6:  2    8  2      1   1  2  (has been active in hour "6" and "7")
7:  3   23  3      1   3  0  (first occurence)
8:  3   23  1      1   2  1  (has been active in hour "23")
9:  3   23  1      0   1  1  (has been active in hour "23" only)
Run Code Online (Sandbox Code Playgroud)

要获得活动小时数,我会这样做:

dt[, nb_active_hours := length(unique(hour)), by=id]
Run Code Online (Sandbox Code Playgroud)

但是我想要到目前为止的部分.我不知道该怎么做...任何帮助将不胜感激.

Dav*_*urg 7

这似乎有效(虽然没有在不同情况下测试)

dt[, nb_active_hours_so_far := cumsum(c(0:1, diff(hour[-.N]))>0), by = id]
#    id hour ip target day temp nb_active_hours_so_far
# 1:  1    1  1      1   1    0                      0
# 2:  1    5  1      0   1    1                      1
# 3:  1    5 45      0   1    1                      2
# 4:  2    6  2      1   1    0                      0
# 5:  2    7  2      1   1    1                      1
# 6:  2    8  2      1   1    2                      2
# 7:  3   23  3      1   3    0                      0
# 8:  3   23  1      1   2    0                      1
# 9:  3   23  1      0   1    0                      1
Run Code Online (Sandbox Code Playgroud)


Col*_*vel 7

Yerk.我有这个丑陋的解决方案:

library(data.table)
dt[ ,nb_active_hours_so_far:=c(0,head(cumsum(c(1,diff(hour)>0)), -1)),id][]

#   id hour ip target day nb_active_hours_so_far
#1:  1    1  1      1   1                      0
#2:  1    5  1      0   1                      1
#3:  1    5 45      0   1                      2
#4:  2    6  2      1   1                      0
#5:  2    7  2      1   1                      1
#6:  2    8  2      1   1                      2
#7:  3   23  3      1   3                      0
#8:  3   23  1      1   2                      1
#9:  3   23  1      0   1                      1
Run Code Online (Sandbox Code Playgroud)


akr*_*run 7

或者您可以使用rleid/shiftdevel版本中的函数data.table,即v1.9.5.安装devel版本的说明是here.(感谢@Frank shift)

 library(data.table)
 dt[,nb_active_hours_so_far := shift(rleid(hour),fill=0L), id]
 #   id hour ip target day nb_active_hours_so_far
 #1:  1    1  1      1   1                      0
 #2:  1    5  1      0   1                      1
 #3:  1    5 45      0   1                      2
 #4:  2    6  2      1   1                      0
 #5:  2    7  2      1   1                      1
 #6:  2    8  2      1   1                      2
 #7:  3   23  3      1   3                      0
 #8:  3   23  1      1   2                      1
 #9:  3   23  1      0   1                      1
Run Code Online (Sandbox Code Playgroud)