Ben*_*Ben 10 r data.table
我试图在多变量设置中测量一些数据的经验累积分布.也就是说,给定一个数据集
library(data.table) # v 1.9.7
set.seed(2016)
dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
dt
x y z
1: -0.91474 2.07025 -1.7499
2: 1.00125 -1.80941 -1.3856
3: -0.05642 1.58499 0.8110
4: 0.29665 -1.16660 0.3757
5: -2.79147 -1.75526 1.2851
---
996: 0.63423 0.13597 -2.3710
997: 0.21415 1.03161 -1.5440
998: 1.15357 -1.63713 0.4191
999: 0.79205 -0.56119 0.6670
1000: 0.19502 -0.05297 -0.3288
Run Code Online (Sandbox Code Playgroud)
我想计算样本的数量,使(x <= X,y <= Y,z <= Z)对于(X,Y,Z)上界的某些网格,如
bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))
bounds
X Y Z
1: -2 -2 -2.0
2: -2 -2 -1.9
3: -2 -2 -1.8
4: -2 -2 -1.7
5: -2 -2 -1.6
---
68917: 2 2 1.6
68918: 2 2 1.7
68919: 2 2 1.8
68920: 2 2 1.9
68921: 2 2 2.0
Run Code Online (Sandbox Code Playgroud)
现在,我已经发现我可以优雅地做到这一点(使用非equi连接)
dt[, Count := 1]
result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]
result[, CDF := N.cum/nrow(dt)]
result
X Y Z N.cum CDF
1: -2 -2 -2.0 0 0.000
2: -2 -2 -1.9 0 0.000
3: -2 -2 -1.8 0 0.000
4: -2 -2 -1.7 0 0.000
5: -2 -2 -1.6 0 0.000
---
68917: 2 2 1.6 899 0.899
68918: 2 2 1.7 909 0.909
68919: 2 2 1.8 917 0.917
68920: 2 2 1.9 924 0.924
68921: 2 2 2.0 929 0.929
Run Code Online (Sandbox Code Playgroud)
但是当我开始增加bin计数时,这种方法效率非常低并且变得非常慢.我认为多变量版本data.table的滚动连接功能可以解决这个问题,但据我所知,这是不可能的.有什么建议加快这个吗?
弄清楚了.
# Step1 - map each sample to the nearest X, Y, and Z above it. (In other words, bin the data.)
X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]
dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
# Step2 - aggregate by unique (X, Y, Z) triplets and count the samples directly below each of these bounds.
bg <- dt[, .N, keyby=list(X, Y, Z)]
# Step4 - Get the count of samples directly below EVERY (X, Y, Z) bound
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]
# Step5 (the tricky part) - Consider a single (Y, Z) pair. X will be in ascending order. So we can do a cumsum on X for each (Y, Z) to count x <= X | Y,Z. Now if you hold X and Z fixed, you can do a cumsum on Y (which is also in ascending order) to count x <= X, y <= Y | Z. And then just continue this process.
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]
# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]
Run Code Online (Sandbox Code Playgroud)
对于任何想要它的人,我将其概括为使用任意数量的变量并将该函数转储到我的R包mltools中.
例如,要解决此问题,您可以这样做
library(mltools)
bounds <- list(x=seq(-2, 2, by=.1), y=seq(-2, 2, by=.1), z=seq(-2, 2, by=.1))
empirical_cdf(x=dt, ubounds=bounds)
x y z N.cum CDF
1: -2 -2 -2.0 0 0.000
2: -2 -2 -1.9 0 0.000
3: -2 -2 -1.8 0 0.000
4: -2 -2 -1.7 0 0.000
5: -2 -2 -1.6 0 0.000
---
68917: 2 2 1.6 899 0.899
68918: 2 2 1.7 909 0.909
68919: 2 2 1.8 917 0.917
68920: 2 2 1.9 924 0.924
68921: 2 2 2.0 929 0.929
Run Code Online (Sandbox Code Playgroud)