在大型数据集上进行逐行操作

Question

在大型数据集上进行逐行操作

我正在寻找一种更快的方式来实现下面的操作.数据集包含> 1M行,但我提供了一个简化示例来说明任务 -

To create the data table --

dt <- data.table(name=c("john","jill"), a1=c(1,4), a2=c(2,5), a3=c(3,6), 
      b1=c(10,40), b2=c(20,50), b3=c(30,60))

colGroups <- c("a","b")   # Columns starting in "a", and in "b"

Original Dataset
-----------------------------------
name    a1   a2   a3   b1   b2   b3
john    1    2    3    10   20   30
jill    4    5    6    40   50   60

Run Code Online (Sandbox Code Playgroud)

上面的数据集被转换为每个唯一名称添加2个新行,并且在每一行中,每个列的列独立地左移(在这个例子中我使用了列和b列,但还有更多)

Transformed Dataset
-----------------------------------
name    a1   a2   a3   b1   b2   b3
john    1    2    3    10   20   30  # First Row for John
john    2    3    0    20   30    0  # "a" values left shifted, "b" values left shifted
john    3    0    0    30   0     0  # Same as above, left-shifted again

jill    4    5    6    40   50   60  # Repeated for Jill
jill    5    6    0    50   60    0 
jill    6    0    0    60    0    0

Run Code Online (Sandbox Code Playgroud)

等等.我的数据集非常大,这就是我试图查看是否有一种有效的方法来实现它的原因.

提前致谢.

Answer 1

Aru*_*run 5

更新:一个(更快)更快的解决方案是使用索引如下(在1e6*7上花费大约4秒):

ll <- vector("list", 3)
ll[[1]] <- copy(dt[, -1])
d_idx <- seq(2, ncol(dt), by=3)
for (j in 1:2) {
    tmp <- vector("list", 2)
    for (i in seq_along(colGroups)) {
        idx <- ((i-1)*3+2):((i*3)+1)
        cols <- setdiff(idx, d_idx[i]:(d_idx[i]+j-1))
        # ..cols means "look up one level"
        tmp[[i]] <- cbind(dt[, ..cols], data.table(matrix(0, ncol=j)))
    }
    ll[[j+1]] <- do.call(cbind, tmp)
}
ans <- cbind(data.table(name=dt$name), rbindlist(ll))
setkey(ans, name)

Run Code Online (Sandbox Code Playgroud)

第一次尝试(旧): 非常有趣的问题.我使用melt.data.table和dcast.data.table(从1.8.11开始)接近它如下:

require(data.table)
require(reshape2)
# melt is S3 generic, calls melt.data.table, returns a data.table (very fast)
ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, 
                    grp := rep(colGroups, each=nrow(dt)*3)]
setkey(ans, name, grp)
ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], 
          variable[1:(.N-2)]), value=c(value, value[-1],
     value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)]
# dcast in reshape2 is not yet a S3 generic, have to call by full name
ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]

Run Code Online (Sandbox Code Playgroud)

对具有相同列数的1e6行进行基准测试:

require(data.table)
require(reshape2)
set.seed(45)
N <- 1e6
dt <- cbind(data.table(name=paste("x", 1:N, sep="")), 
               matrix(sample(10, 6*N, TRUE), nrow=N))
setnames(dt, c("name", "a1", "a2", "a3", "b1", "b2", "b3"))
colGroups = c("a", "b")

system.time({
ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, 
                    grp := rep(colGroups, each=nrow(dt)*3)]
setkey(ans, name, grp)
ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], 
          variable[1:(.N-2)]), value=c(value, value[-1],
     value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)]
ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]

})

#   user  system elapsed 
# 45.627   2.197  52.051

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，5 月前
查看次数：	915 次
最近记录：	6 年，4 月前