快速计算多列上的CDF /滚动连接

Ben*_*Ben 10 r data.table

我试图在多变量设置中测量一些数据的经验累积分布.也就是说,给定一个数据集

library(data.table)  # v 1.9.7

set.seed(2016)
dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
dt
             x        y       z
   1: -0.91474  2.07025 -1.7499
   2:  1.00125 -1.80941 -1.3856
   3: -0.05642  1.58499  0.8110
   4:  0.29665 -1.16660  0.3757
   5: -2.79147 -1.75526  1.2851
  ---                          
 996:  0.63423  0.13597 -2.3710
 997:  0.21415  1.03161 -1.5440
 998:  1.15357 -1.63713  0.4191
 999:  0.79205 -0.56119  0.6670
1000:  0.19502 -0.05297 -0.3288
Run Code Online (Sandbox Code Playgroud)

我想计算样本的数量,使(x <= X,y <= Y,z <= Z)对于(X,Y,Z)上界的某些网格,如

bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))
bounds
        X  Y    Z
    1: -2 -2 -2.0
    2: -2 -2 -1.9
    3: -2 -2 -1.8
    4: -2 -2 -1.7
    5: -2 -2 -1.6
   ---           
68917:  2  2  1.6
68918:  2  2  1.7
68919:  2  2  1.8
68920:  2  2  1.9
68921:  2  2  2.0
Run Code Online (Sandbox Code Playgroud)

现在,我已经发现我可以优雅地做到这一点(使用非equi连接)

dt[, Count := 1]
result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]
result[, CDF := N.cum/nrow(dt)]
result
        X  Y    Z N.cum   CDF
    1: -2 -2 -2.0     0 0.000
    2: -2 -2 -1.9     0 0.000
    3: -2 -2 -1.8     0 0.000
    4: -2 -2 -1.7     0 0.000
    5: -2 -2 -1.6     0 0.000
   ---                       
68917:  2  2  1.6   899 0.899
68918:  2  2  1.7   909 0.909
68919:  2  2  1.8   917 0.917
68920:  2  2  1.9   924 0.924
68921:  2  2  2.0   929 0.929
Run Code Online (Sandbox Code Playgroud)

但是当我开始增加bin计数时,这种方法效率非常低并且变得非常慢.我认为多变量版本data.table的滚动连接功能可以解决这个问题,但据我所知,这是不可能的.有什么建议加快这个吗?

Ben*_*Ben 5

弄清楚了.

# Step1 - map each sample to the nearest X, Y, and Z above it. (In other words, bin the data.)

X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]

dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]

# Step2 - aggregate by unique (X, Y, Z) triplets and count the samples directly below each of these bounds.
bg <- dt[, .N, keyby=list(X, Y, Z)]

# Step4 - Get the count of samples directly below EVERY (X, Y, Z) bound
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]

# Step5 (the tricky part) - Consider a single (Y, Z) pair. X will be in ascending order. So we can do a cumsum on X for each (Y, Z) to count x <= X | Y,Z. Now if you hold X and Z fixed, you can do a cumsum on Y (which is also in ascending order) to count x <= X, y <= Y | Z. And then just continue this process.
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]

# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]
Run Code Online (Sandbox Code Playgroud)

概括

对于任何想要它的人,我将其概括为使用任意数量的变量并将该函数转储到我的R包mltools中.

例如,要解决此问题,您可以这样做

library(mltools)

bounds <- list(x=seq(-2, 2, by=.1), y=seq(-2, 2, by=.1), z=seq(-2, 2, by=.1))
empirical_cdf(x=dt, ubounds=bounds)
        x  y    z N.cum   CDF
    1: -2 -2 -2.0     0 0.000
    2: -2 -2 -1.9     0 0.000
    3: -2 -2 -1.8     0 0.000
    4: -2 -2 -1.7     0 0.000
    5: -2 -2 -1.6     0 0.000
   ---                       
68917:  2  2  1.6   899 0.899
68918:  2  2  1.7   909 0.909
68919:  2  2  1.8   917 0.917
68920:  2  2  1.9   924 0.924
68921:  2  2  2.0   929 0.929
Run Code Online (Sandbox Code Playgroud)