在R中使用唯一函数矢量化for循环

Can*_*ice 7 loops r list unique vectorization

player_ids = c(34, 87, 27, 34, 87, 9, 29, 25, 24, 25, 34, 37)
end = length(player_ids)
unique_players_list = list()

for(i in 1:end) {
  unique_players_list[[i]] = unique(player_ids_unlisted[1:i])
}
Run Code Online (Sandbox Code Playgroud)

这是我试图矢量化的for循环(缩短版本).我不知道如何发布代码输出,但是unique_players_list列表应该具有以下输出:

unique_players_list[[1]] == c(34)
unique_players_list[[2]] == c(34)
unique_players_list[[3]] == c(34, 87)
unique_players_list[[4]] == c(34, 87, 27)     
unique_players_list[[5]] == c(34, 87, 27)
Run Code Online (Sandbox Code Playgroud)

"等等.输出不必在列表中,我实际上更喜欢数据帧,但是我需要这个矢量化,因为我当前的for循环需要永远,我需要运行这个代码数万次."

谢谢!

Mar*_*gan 8

这个问题的一个相当字面的实现是沿着玩家ID,然后返回id的头部的独特元素.

f0 <- function(player_ids)
    lapply(seq_along(player_ids), function(i) unique(head(player_ids, i)))
Run Code Online (Sandbox Code Playgroud)

这避免了管理结果列表的分配的需要,并且还处理了何时的情况length(player_ids) == 0L.为了更有效地实现,请创建"累积"集合列表

uid <- unique(player_ids)
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
Run Code Online (Sandbox Code Playgroud)

然后识别属于第i个索引的集合

did <- !duplicated(player_ids)
sets[cumsum(did)]
Run Code Online (Sandbox Code Playgroud)

以下是目前为止的一些解决方案

f1 <- function(player_ids) {
    end = length(player_ids)
    tank <- player_ids[1]

    unique_players_list = vector("list", end)
    for(i in 1:end) {
        if (!player_ids[i] %in% tank) tank <- c(tank, player_ids[i])
        unique_players_list[[i]] = tank
    }
    unique_players_list
}

f2 <- function(player_ids) {
    un = unique(player_ids)
    ma = match(un, player_ids)
    li = vector("list", length(player_ids))

    for (i in seq_along(player_ids))
        li[[i]] = un[ma <= i]
    li
}

f3 <- function(player_ids) {
    uid <- unique(player_ids)
    sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
    sets[cumsum(!duplicated(player_ids))]
}
Run Code Online (Sandbox Code Playgroud)

一些基本的测试,他们正在产生合理的结果

> identical(f1(player_ids), f2(player_ids))
[1] TRUE
> identical(f1(player_ids), f3(player_ids))
[1] TRUE
Run Code Online (Sandbox Code Playgroud)

以及对更大数据集的性能评估

> library(microbenchmark)
> ids <- sample(100, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: microseconds
    expr       min        lq       mean     median        uq       max neval
 f1(ids) 24397.193 25820.375 32055.5720 26475.8245 28030.866 56487.781    10
 f2(ids) 20607.564 22148.888 34462.5850 24432.4785 51722.208 53473.468    10
 f3(ids)   414.649   458.271   772.3738   501.5185   686.383  2163.261    10
Run Code Online (Sandbox Code Playgroud)

f3()当初始值的向量与唯一值的数量相比较大时,表现良好.这是一个数据集,其中原始向量中的元素大多是唯一的,并且时间更具可比性

> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval
 f1(ids) 214.2505 232.3902 233.7632 233.4617 237.5509 249.4652    10
 f2(ids) 433.5181 443.5987 512.4475 463.8388 467.3710 949.4882    10
 f3(ids) 299.2291 301.4931 307.7576 302.9375 316.6055 321.3942    10
Run Code Online (Sandbox Code Playgroud)

使边缘情况正确是重要的,例如,常见问题是零长度矢量f2(integer()).f1()不处理这种情况.有趣的是,我认为所有实现都与输入类型无关,例如,f1(sample(letters, 100, TRUE))工作.

一些离线讨论导致建议返回格式既不方便也不节省内存,duplicated()并且unique()在某种程度上类似的操作,所以我们应该能够通过一次调用逃脱.这导致以下解决方案,其将每个player_id的唯一标识符和偏移的列表返回到唯一标识符的末尾.

f5 <- function(player_ids) {
    did <- !duplicated(player_ids)
    list(uid = player_ids[did], end_idx = cumsum(did))
}
Run Code Online (Sandbox Code Playgroud)

结果不能直接与之identical()相似或类似.更新f3()

f3a <- function(player_ids) {
    did <- !duplicated(player_ids)
    uid <- player_ids[did]
    sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
    sets[cumsum(did)]
}
Run Code Online (Sandbox Code Playgroud)

以下是一些性能指标

> ids <- sample(100, 10000, TRUE)
> print(object.size(f3(ids)), units="auto")
4.2 Mb
> print(object.size(f5(ids)), units="auto")
39.8 Kb
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
    expr     min      lq     mean   median      uq     max neval
 f3(ids) 437.663 445.091 450.3965 447.3755 452.629 476.016    10
f3a(ids) 342.378 351.408 385.0844 354.2375 369.861 638.084    10
 f5(ids) 125.956 127.684 129.9898 128.5890 130.202 140.521    10
Run Code Online (Sandbox Code Playgroud)

> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
     expr        min         lq         mean     median          uq         max
  f3(ids) 816317.361 821892.902  911862.5561 831274.596 1107496.984 1112586.295
 f3a(ids) 824593.618 827590.130 1009032.9519 829197.863  838559.619 2607916.641
  f5(ids)    213.677    270.397     313.1614    282.213     315.683     601.724
 neval
    10
    10
    10
Run Code Online (Sandbox Code Playgroud)


李哲源*_*李哲源 5

我怀疑您发布的代码中有错字。我想你的意思是

unique_players_list[[i]] = unique(player_ids[1:i])
Run Code Online (Sandbox Code Playgroud)

好吧,循环缓慢的原因是您unique()在每次迭代中执行。在第 i 次迭代中,成本为O(i),然后当您循环 时1:n,成本聚合为O(n^2),最终成本太高。

我们想要的是线性成本:O(n)。以下代码执行此操作。基本上,我们初始化tank以保存已识别的唯一值,然后在新值出现时更新它。

player_ids <- c(34, 87, 27, 34, 87, 9, 29, 25, 24, 25, 34, 37)
end <- length(player_ids)
tank <- player_ids[1]

unique_players_list <- vector(mode = "list", end)
for(i in 1:end) {
  if (!player_ids[i] %in% tank) tank <- c(tank, player_ids[i])
  unique_players_list[[i]] <- tank
}
Run Code Online (Sandbox Code Playgroud)

对于这种类型的操作,由于第(i+1)个结果依赖于第i个结果,矢量化是不可能的。

  • 是的,尝试使用 `player_ids = Sample(100, 10000, TRUE)`; 对我来说,预分配速度快了 10 倍。 (3认同)
  • 预先分配`unique_players_list`以避免重复复制,`unique_players_list = vector("list", end)` (2认同)

Gre*_*gor 5

我会这样做。我们可以在循环之前使用大量向量化函数,而在循环内只使用简单的索引。

un = unique(player_ids)
ma = match(un, player_ids)
li = vector("list", length(player_ids))

for (i in seq_along(player_ids)) {
    li[[i]] = un[ma <= i]
}

head(li)
# [[1]]
# [1] 34
# 
# [[2]]
# [1] 34 87
# 
# [[3]]
# [1] 34 87 27
# 
# [[4]]
# [1] 34 87 27
# 
# [[5]]
# [1] 34 87 27
# 
# [[6]]
# [1] 34 87 27  9
Run Code Online (Sandbox Code Playgroud)