有效地查找查找表中向量的所有匹配项,并重复

dww*_*dww 11 r

我想x在另一个查找向量中查找向量的所有匹配项的索引table

table = rep(1:5, each=3)
x = c(2, 5, 2, 6)
Run Code Online (Sandbox Code Playgroud)

标准的基本 R 方法并不能完全满足我的需求。例如使用which(table %in% x)我们只获得一次匹配索引,即使2在中出现两次x

which(table %in% x)
# [1]  4  5  6 13 14 15
Run Code Online (Sandbox Code Playgroud)

另一方面,match返回每次出现匹配的 x 的值,但仅返回查找表中的第一个索引。

match(x, table)
# [1]  4 13  4 NA
Run Code Online (Sandbox Code Playgroud)

我想要的是一个返回“所有 x 和所有 y”索引的函数。即它应该返回以下期望的结果:

mymatch(x, table)
# c(4, 5, 6, 13, 14, 15, 4, 5, 6)
Run Code Online (Sandbox Code Playgroud)

当然,我们可以使用 R 中的循环来完成此操作:

mymatch = function(x, table) {
  matches = sapply(x, \(xx) which(table %in% xx)) 
  unlist(matches)
}

mymatch(x, table)
# [1]  4  5  6 13 14 15  4  5  6
Run Code Online (Sandbox Code Playgroud)

但这对于较大的数据来说非常慢(我需要在大数据上多次执行此操作)

table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)
system.time(mymatch(x, table))
#  user  system elapsed 
# 3.279   2.881   6.157 
Run Code Online (Sandbox Code Playgroud)

如果我们将其与以下示例进行比较,这是非常慢的which %in%

system.time(which(table %in% x))
#  user  system elapsed 
# 0.003   0.004   0.008 
Run Code Online (Sandbox Code Playgroud)

希望 R 中有一种快速的方法来做到这一点?否则,也许 RCpp 是可行的方法。

Ony*_*mbu 14

另一种方法是使用split

\n
unlist(split(seq(table), table)[as.character(x)],use.names = FALSE)\n[1]  4  5  6 13 14 15  4  5  6\n
Run Code Online (Sandbox Code Playgroud)\n
\n

编辑:

\n

请注意,如果table已排序,那么您可以使用rle + sequence

\n
faster <- function(x, table){\n  a <- rle(table)\n  n <- length(a$lengths)\n  idx <- match(x, a$values, 0)\n  sequence(a$lengths[idx], cumsum(c(1,a$lengths[-n]))[idx])\n}\n\nset.seed(42)\ntable = rep(1:1e5, each=10)\nx = sample(1:100, 1000, replace = TRUE)\nbench::mark(\n   faster(x, table),\n   #mymatch(x, table) |> as.vector(),\n   join_match(x, table),\n   #unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),\n   check = TRUE\n )\n\n\n# A tibble: 2 \xc3\x97 13\n  expression     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    \n  <bch:expr> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    \n1 faster(x,\xe2\x80\xa6  54.4ms  252ms      3.97    54.9MB     1.99     2     1      503ms <int>  <Rprofmem>\n2 join_matc\xe2\x80\xa6 127.7ms  254ms      3.93    88.8MB     5.90     2     3      508ms <int>  <Rprofmem>\n# \xe2\x84\xb9 2 more variables: time <list>, gc <list>\n
Run Code Online (Sandbox Code Playgroud)\n
\n

只要表已排序,该函数就可以工作。不一定是 1:n。

\n
table = c(rep(1:5, each=3), 7,7,7,7,10,10)\nx = c(10, 2, 5,7, 2, 6)\n\nmicrobenchmark::microbenchmark(\n   faster(x, table),\n   #mymatch(x, table) |> as.vector(),\n   join_match(x, table),\n   #unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),\n   check = 'equal'\n )\nUnit: microseconds\n                 expr      min       lq       mean   median       uq       max neval\n     faster(x, table)   23.001   32.751   56.95703   56.400   66.201   222.901   100\n join_match(x, table) 4216.201 4925.302 6616.51401 5572.951 7842.200 21153.402   100\n
Run Code Online (Sandbox Code Playgroud)\n

  • @dww 请注意,如果“x”中有很多重复项,您宁愿只匹配唯一值和特定 x 值的提取。即不是执行 `match(c(2,2,2,2,2), table)` 只是 `match(2,table)` 然后相应地索引 (2认同)

Jon*_*ing 9

作为加入应该更快。这快了 100 倍以上。

\n
library(dplyr)\njoin_match = function(x, table) {\n  t <- data.frame(table, index = 1:length(table))\n  xt <- data.frame(x, index = 1:length(x))\n  t |>\n    left_join(xt, join_by(table == x), relationship = 'many-to-many') |>\n    arrange(index.y) %>%\n    filter(!is.na(index.y)) %>%\n    pull(index.x)\n}\n
Run Code Online (Sandbox Code Playgroud)\n

输出相同,速度快 100-200 倍,比 @Onyambu 的基本 R 建议快约 3 倍(注意:该方法已更新为类似的速度,并且 data.table 解决方案甚至更快。使用 duckdb 或 arrow ,或者折叠起来进行连接可能会更快。但我的观察仍然是,通过将其视为连接,您可以获得显着的速度提高和易读性):

\n
set.seed(42)\ntable = rep(1:1e5, each=10)\nx = sample(1:100, 1000, replace = TRUE)\nbench::mark(\n  mymatch(x, table) |> as.vector(),\n  join_match(x, table),\n  unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),\n  check = TRUE\n)\n\n# A tibble: 3 \xc3\x97 13\n  expression                                                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result\n  <bch:expr>                                                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>\n1 as.vector(mymatch(x, table))                                            13.8s    13.8s    0.0727    14.9GB     2.83     1    39      13.8s <int> \n2 join_match(x, table)                                                   48.7ms   62.2ms   13.8       88.8MB     3.95     7     2    506.3ms <int> \n3 unlist(split(seq(table), table)[as.character(x)], use.names = FALSE)  183.6ms  184.5ms    5.31      29.8MB     0        3     0    564.9ms <int> \n
Run Code Online (Sandbox Code Playgroud)\n


jar*_*rot 9

也许 data.table 是一个选择?如果您有相对较大的表/向量,您可能会看到速度的提高,特别是如果您采用 Jon Spring 的“连接”方法:

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#> 
#>     hour, isoweek, mday, minute, month, quarter, second, wday, week,
#>     yday, year
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
library(microbenchmark)

onyambu_faster <- function(x, table){
  a <- rle(table)
  n <- length(a$lengths)
  idx <- match(x, a$values, 0)
  sequence(a$lengths[idx], cumsum(c(1,a$lengths[-n]))[idx])
}

jon_spring_join_match = function(x, table) {
  t <- data.frame(table, index = 1:length(table))
  xt <- data.frame(x, index = 1:length(x))
  t |>
    left_join(xt, join_by(table == x), relationship = 'many-to-many') |>
    arrange(index.y) %>%
    filter(!is.na(index.y)) %>%
    pull(index.x)
}

jared_mamrot_dt <- function(x, table){
  table_dt <- data.table(table, index = 1:length(table))
  x_dt <- data.table(x, index = 1:length(x))
  return(na.omit(table_dt[x_dt, on = .(table == x)][,index]))
}

table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)

all.equal(onyambu_faster(x, table), jared_mamrot_dt(x, table))
#> [1] TRUE
all.equal(jon_spring_join_match(x, table), jared_mamrot_dt(x, table))
#> [1] TRUE

res <- microbenchmark(onyambu_faster(x, table),
                      jon_spring_join_match(x, table),
                      jared_mamrot_dt(x, table),
                      times = 10)
res
#> Unit: milliseconds
#>                             expr       min       lq      mean   median
#>         onyambu_faster(x, table) 38.196317 45.08884  65.22651 52.40748
#>  jon_spring_join_match(x, table) 48.697968 74.54407 105.79551 83.11473
#>        jared_mamrot_dt(x, table)  9.441176 11.34315  12.99648 11.76324
#>         uq       max neval cld
#>   64.88688 129.38505    10 a  
#>  131.50681 221.16477    10  b 
#>   14.05289  21.84779    10   c
autoplot(res)
Run Code Online (Sandbox Code Playgroud)

创建于 2023 年 10 月 26 日,使用reprex v2.0.2


G. *_*eck 6

使用问题中的数据,根据我的机器上的中值时间,速度是原来的两倍。

table = rep(1:5, each=3)
x = c(2, 5, 2, 6)

mymatch = function(x, table) {
  matches = sapply(x, \(xx) which(table %in% xx)) 
  unlist(matches)
}

outer_match <- function(x, table) {
  z1 <- outer(table, x, "==") 
  z2 <- z1 * row(z1)
  z2[z2 != 0]
}

outer_match(x, table)
## [1]  4  5  6 13 14 15  4  5  6

library(microbenchmark)
microbenchmark(
 mymatch(x, table),
 outer_match(x, table)
)
## Unit: microseconds
##                   expr  min    lq    mean median   uq    max neval cld
##      mymatch(x, table) 77.0 79.15 166.696  82.75 84.3 8384.9   100   a
##  outer_match(x, table) 35.1 36.75 115.783  41.95 43.1 7410.1   100   a
Run Code Online (Sandbox Code Playgroud)