用于查找与向量中的唯一值相关联的索引的高效R代码

use*_*361 8 r list vector unique data.table

假设我有矢量vec <- c("D","B","B","C","C").

我的目标是最终得到一个维度列表length(unique(vec)),其中每个i列表返回一个索引向量,表示unique(vec)[i]in 的位置vec.

例如,此列表vec将返回:

exampleList <- list()
exampleList[[1]] <- c(1) #Since "D" is the first element
exampleList[[2]] <- c(2,3) #Since "B" is the 2nd/3rd element.
exampleList[[3]] <- c(4,5) #Since "C" is the 4th/5th element.
Run Code Online (Sandbox Code Playgroud)

我尝试了以下方法,但它太慢了.我的例子很大,所以我需要更快的代码:

vec <- c("D","B","B","C","C")
uniques <- unique(vec)
exampleList <- lapply(1:3,function(i) {
    which(vec==uniques[i])
})
exampleList
Run Code Online (Sandbox Code Playgroud)

edd*_*ddi 6

更新:行为DT[, list(list(.)), by=.]有时导致R版本> = 3.1.0中的错误结果.现在,在data.table v1.9.3的当前开发版本的commit#1280中修复了这个问题.来自新闻:

  • DT[, list(list(.)), by=.]在R> = 3.1.0中也返回正确的结果.该错误是由于R v3.1.0中的最近(欢迎)更改list(.)导致副本无效.关闭#481.

使用data.table速度比tapply以下快15倍:

library(data.table)

vec <- c("D","B","B","C","C")

dt = as.data.table(vec)[, list(list(.I)), by = vec]
dt
#   vec  V1
#1:   D   1
#2:   B 2,3
#3:   C 4,5

# to get it in the desired format
# (perhaps in the future data.table's setnames will work for lists instead)
setattr(dt$V1, 'names', dt$vec)
dt$V1
#$D
#[1] 1
#
#$B
#[1] 2 3
#
#$C
#[1] 4 5
Run Code Online (Sandbox Code Playgroud)

速度测试:

vec = sample(letters, 1e7, T)

system.time(tapply(seq_along(vec), vec, identity)[unique(vec)])
#   user  system elapsed 
#   7.92    0.35    8.50 

system.time({dt = as.data.table(vec)[, list(list(.I)), by = vec]; setattr(dt$V1, 'names', dt$vec); dt$V1})
#   user  system elapsed 
#   0.39    0.09    0.49 
Run Code Online (Sandbox Code Playgroud)


jos*_*ber 5

您可以通过以下方式执行此操作tapply

vec <- c("D", "B", "B", "C", "C")
tapply(seq_along(vec), vec, identity)[unique(vec)]
# $D
# [1] 1
# 
# $B
# [1] 2 3
# 
# $C
# [1] 4 5
Run Code Online (Sandbox Code Playgroud)

identity函数返回其参数作为结果,并且索引unique(vec)确保您以原始向量中元素的相同顺序返回它。

  • 类似的方法可以是:`split(seq_along(vec), vec)` (5认同)

leb*_*nok 5

split(seq_along(vec), vec)
Run Code Online (Sandbox Code Playgroud)

这比 tapply 解决方案更快更短:

vec = sample(letters, 1e7, T)
system.time(res1 <- tapply(seq_along(vec), vec, identity)[unique(vec)])
#   user  system elapsed 
#  1.808   0.364   2.176 
system.time(res2 <- split(seq_along(vec), vec))
#   user  system elapsed 
#  0.876   0.152   1.029 
Run Code Online (Sandbox Code Playgroud)