Pha*_*ann 5 grouping r intervals data.table
我正在将R和包data.table一起使用,并且我想通过运行(时间)间隔或重叠箱将data.table分组。对于这些运行间隔中的每一个,我都希望找到相等数据对的出现。此外,这些“数据对相等”应该不完全相等,而应在一定的间隔范围内。
该问题的简单版本如下:
#Time X Y Counts
# ... ... ... 1
#I would like to do:
DT[, sum(counts), by = list(Time, X, Y)]
#with Time, X and Y being in overlapping intervals.
Run Code Online (Sandbox Code Playgroud)
findintervals() 会给我的垃圾箱带有“硬边界”,而不是重叠的边界。
更详细的问题:假设我有一个类似的data.table:
Time <- c(1,1,2,4,5,5,6,7,8,8,8,8,12,13)
#more equal time values are allowed.
X <- c(6,6,7,10,5,7,6,3,9,10,6,3,3,6)
Y <- c(2,6,10,3,4,6,6,9,4,9,6,6,9,9)
DT <- data.table(Time, X, Y)
Time X Y
1: 1 6 2
2: 1 6 6
3: 2 7 10
4: 4 10 3
5: 5 5 4
6: 5 7 6
7: 6 6 6
8: 7 3 9
9: 8 9 4
10: 8 10 9
11: 8 6 6
12: 8 3 6
13: 12 3 9
14: 13 6 9
Run Code Online (Sandbox Code Playgroud)
还有一些预定义的间隔大小:
Timeinterval <- 5
#for a time value of 10 this means to look from 10-5 to 10+5
RangeX.percentage <- 0.5
RangeY.percentage <- 0.5
Run Code Online (Sandbox Code Playgroud)
结果应该给我额外的一列,考虑到X和Y的范围,我们称其为“计数”,同时出现相等的数据对X和Y。
我考虑过按时间间隔进行分组
c(1, 1, 2, 4, 5, 5, 6) #for the first item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7) # for the second item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7, 8, 8, 8, 8) #for the third item (2-5):(2+5)
#...
c(8, 8, 8, 8, 12, 13) # for the last item (13-5):(13+5)
Run Code Online (Sandbox Code Playgroud)
以及数据的以下条件(但该部分也可能有一个更简单的版本):
编辑:澄清结果应该是什么样子:
Ranges <- DT[ , list(
X* (1 + RangeX.percentage), X* (1 - RangeX.percentage),
Y* (1 + RangeY.percentage), Y* (1 - RangeY.percentage))]
DT2 <- cbind(DT, Ranges, count = rep(1, nrow(DT)))
setnames(DT2, c("Time","X","Y","XR1","XR2","YR1","YR2","count"))
for (i in 1:nrow(DT2)){
#main part of the question how to get this done within data.table:
DT2.subset <- DT2[which(abs(Time - DT2[i]$Time) < Timeinterval)]
#subsequent comparison of X and Y:
DT[i]$Count<- length(which(DT2.subset$X < DT2[i]$XR1 &
DT2.subset$X > DT2[i]$XR2 &
DT2.subset$Y < DT2[i]$YR1 &
DT2.subset$Y > DT2[i]$YR2))
}
DT2
Time X Y XR1 XR2 YR1 YR2 count
1: 1 6 2 9.0 3.0 3.0 1.0 1
2: 1 6 6 9.0 3.0 9.0 3.0 3
3: 2 7 10 10.5 3.5 15.0 5.0 4
4: 4 10 3 15.0 5.0 4.5 1.5 3
5: 5 5 4 7.5 2.5 6.0 2.0 1
6: 5 7 6 10.5 3.5 9.0 3.0 6
7: 6 6 6 9.0 3.0 9.0 3.0 4
8: 7 3 9 4.5 1.5 13.5 4.5 2
9: 8 9 4 13.5 4.5 6.0 2.0 3
10: 8 10 9 15.0 5.0 13.5 4.5 4
11: 8 6 6 9.0 3.0 9.0 3.0 4
12: 8 3 6 4.5 1.5 9.0 3.0 1
13: 12 3 9 4.5 1.5 13.5 4.5 2
14: 13 6 9 9.0 3.0 13.5 4.5 1
Run Code Online (Sandbox Code Playgroud)
由于我完整的data.table包含超过一百万行,因此检查每一行的所有DT $ time都是对计算时间的混乱。
你可以试试data.table::foverlaps。我们将Ranges像您一样创建,只是添加Time范围和行索引(用于以后的聚合)。这里的主要问题是您不想要 <= 或 >= 而不是 < 和 >,因此我们必须将 +-1 添加到Time间隔中。然后,我们将给too、key 和 run添加一个Time间隔。最后阶段是对每行的观察进行计数。DTfoverlaps
DT[, Time2 := Time] ## Add higher interval to DT
setkey(DT, Time, Time2) ## key (for foverlaps)
Ranges <-
DT[ , .(Time = Time - Timeinterval + 1, ## Add lower Time interval
Time2 = Time + Timeinterval - 1, ## Add higher Time interval
XR1 = X* (1 - RangeX.percentage),
XR2 = X* (1 + RangeX.percentage),
YR1 = Y* (1 - RangeY.percentage),
YR2 = Y* (1 + RangeY.percentage),
indx = .I)] ## Add row index
# Run foverlaps and count incidences by condition while updating DT by reference
DT[,
count := foverlaps(Ranges, DT)[X > XR1 & X < XR2 & Y > YR1 & Y < YR2,
.N,
keyby = indx]$N]
DT
# Time X Y Time2 count
# 1: 1 6 2 1 1
# 2: 1 6 6 1 3
# 3: 2 7 10 2 4
# 4: 4 10 3 4 3
# 5: 5 5 4 5 1
# 6: 5 7 6 5 6
# 7: 6 6 6 6 4
# 8: 7 3 9 7 2
# 9: 8 9 4 8 3
# 10: 8 10 9 8 4
# 11: 8 6 6 8 4
# 12: 8 3 6 8 1
# 13: 12 3 9 12 2
# 14: 13 6 9 13 1
Run Code Online (Sandbox Code Playgroud)