基本上,我有以下2 data.tables:
dt- 包含值字段(y)和分组字段(x)
bk- 包含4个" break "字段(bn),用于描述x在中找到的每个组的区间[1,inf]的存储区结构dt.每个bn代表桶的最小值(包括)并延伸到下一个桶(例如:x= 1 的4个桶是[1,3],[3,5],[5,10],[10,inf] ). 请注意,桶结构不一定是唯一的.
> #4 groups (x), each with a bucket structure defined breaks (bn).
> bk<- data.table(x=c(1:4), b1=c(1,1,1,1), b2=c(3,3,4,4), b3=c(5,5,7,8), b4=c(10,10,10,10), key="x")
> bk
x b1 b2 b3 b4
1: 1 1 3 5 10
2: 2 1 3 5 10
3: 3 1 4 7 10
4: 4 1 4 8 10
> dt<- data.table(x=rep(c(1:4),5), y=rep(c(1:10),2), key="x")
> dt
x y
1: 1 1
2: 1 5
3: 1 9
4: 1 3
5: 1 7
6: 2 2
7: 2 6
8: 2 10
9: 2 4
10: 2 8
11: 3 3
12: 3 7
13: 3 1
14: 3 5
15: 3 9
16: 4 4
17: 4 8
18: 4 2
19: 4 6
20: 4 10
Run Code Online (Sandbox Code Playgroud)
我的目标是添加一个字段b,以dt指示哪些桶(1,2,3,或4)在记录落在基于对应于该组的桶结构x.请参阅下面的所需输出:
x y b
1: 1 1 1 #Buckets for x=1
2: 1 5 3
3: 1 9 3
4: 1 3 2
5: 1 7 3
6: 2 2 1 #Buckets for x=2 (same as 1)
7: 2 6 3
8: 2 10 4
9: 2 4 2
10: 2 8 3
11: 3 3 1 #Buckets for x=3
12: 3 7 3
13: 3 1 1
14: 3 5 2
15: 3 9 3
16: 4 4 2 #Buckets for x=4
17: 4 8 3
18: 4 2 1
19: 4 6 2
20: 4 10 4
Run Code Online (Sandbox Code Playgroud)
我最初的想法是加入两个data.tables并使用cut函数返回每个记录的桶号,但是我遇到了这个break问题.首次尝试如下:
> bkt[dt, .(x, y, b=cut(y, breaks=c(b1, b2, b3, b4, "inf"), include.lowest=TRUE, labels=c(1:4)))]
Error in cut.default(y, breaks = c(b1, b2, b3, b4, "inf"), include.lowest = TRUE, :
'breaks' are not unique
Run Code Online (Sandbox Code Playgroud)
如果我创建一个变量a来保存存储桶结构(例如,for x= 1),则以下工作方式与我预期的一样:
> a<- c(1, 3, 5, 10, "inf")
> bkt[dt, .(x, y, b=cut(y, breaks=a, include.lowest=TRUE, labels=c(1:4)))]
x y b
1: 1 1 1
2: 1 5 2
3: 1 9 3
4: 1 3 1
5: 1 7 3
6: 2 2 1
7: 2 6 3
8: 2 10 3
9: 2 4 2
10: 2 8 3
11: 3 3 1
12: 3 7 3
13: 3 1 1
14: 3 5 2
15: 3 9 3
16: 4 4 2
17: 4 8 3
18: 4 2 1
19: 4 6 3
20: 4 10 3
Run Code Online (Sandbox Code Playgroud)
对于我的应用程序,这仍然不是一个实用的解决方案,但我希望有人可以帮助我理解如何breaks正确地将桶结构信息传递给参数以获得类似的结果.我已经试过的各种组合c,list,unlist,as.numeric函数传递正确的break说法,但有没有运气.任何帮助/见解将不胜感激.谢谢!
完全披露,我是R的新手,这是我的第一篇文章,请温柔.
您可以使用melt.data.table将bk数据集重组为更简单的形式:
bk_long <- melt.data.table(
bk,
id.vars = 'x',
measure.vars = paste0('b', 1:4),
value.name = 'y'
)
setkey(bk_long, x)
bk_long[, variable := NULL]
bk_long[, b := seq_len(.N), by = x]
bk_long
# x y b
# 1: 1 1 1
# 2: 1 3 2
# 3: 1 5 3
# 4: 1 10 4
# 5: 2 1 1
# 6: 2 3 2
# 7: 2 5 3
# 8: 2 10 4
# 9: 3 1 1
# 10: 3 4 2
# 11: 3 7 3
# 12: 3 10 4
# 13: 4 1 1
# 14: 4 4 2
# 15: 4 8 3
# 16: 4 10 4
Run Code Online (Sandbox Code Playgroud)
然后按照弗兰克的建议进行滚动连接:
bk_long[dt, on = c('x', 'y'), roll = TRUE]
# x y b
# 1: 1 1 1
# 2: 1 5 3
# 3: 1 9 3
# 4: 1 3 2
# 5: 1 7 3
# 6: 2 2 1
# 7: 2 6 3
# 8: 2 10 4
# 9: 2 4 2
# 10: 2 8 3
# 11: 3 3 1
# 12: 3 7 3
# 13: 3 1 1
# 14: 3 5 2
# 15: 3 9 3
# 16: 4 4 2
# 17: 4 8 3
# 18: 4 2 1
# 19: 4 6 2
# 20: 4 10 4
Run Code Online (Sandbox Code Playgroud)