我想x在另一个查找向量中查找向量的所有匹配项的索引table。
table = rep(1:5, each=3)
x = c(2, 5, 2, 6)
Run Code Online (Sandbox Code Playgroud)
标准的基本 R 方法并不能完全满足我的需求。例如使用which(table %in% x)我们只获得一次匹配索引,即使2在中出现两次x
which(table %in% x)
# [1] 4 5 6 13 14 15
Run Code Online (Sandbox Code Playgroud)
另一方面,match返回每次出现匹配的 x 的值,但仅返回查找表中的第一个索引。
match(x, table)
# [1] 4 13 4 NA
Run Code Online (Sandbox Code Playgroud)
我想要的是一个返回“所有 x 和所有 y”索引的函数。即它应该返回以下期望的结果:
mymatch(x, table)
# c(4, 5, 6, 13, 14, 15, 4, 5, 6)
Run Code Online (Sandbox Code Playgroud)
当然,我们可以使用 R 中的循环来完成此操作:
mymatch = function(x, table) {
matches = sapply(x, \(xx) which(table %in% xx))
unlist(matches)
}
mymatch(x, table)
# [1] 4 5 6 13 14 15 4 5 6
Run Code Online (Sandbox Code Playgroud)
但这对于较大的数据来说非常慢(我需要在大数据上多次执行此操作)
table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)
system.time(mymatch(x, table))
# user system elapsed
# 3.279 2.881 6.157
Run Code Online (Sandbox Code Playgroud)
如果我们将其与以下示例进行比较,这是非常慢的which %in%:
system.time(which(table %in% x))
# user system elapsed
# 0.003 0.004 0.008
Run Code Online (Sandbox Code Playgroud)
希望 R 中有一种快速的方法来做到这一点?否则,也许 RCpp 是可行的方法。
Ony*_*mbu 14
另一种方法是使用split:
unlist(split(seq(table), table)[as.character(x)],use.names = FALSE)\n[1] 4 5 6 13 14 15 4 5 6\nRun Code Online (Sandbox Code Playgroud)\n编辑:
\n请注意,如果table已排序,那么您可以使用rle + sequence:
faster <- function(x, table){\n a <- rle(table)\n n <- length(a$lengths)\n idx <- match(x, a$values, 0)\n sequence(a$lengths[idx], cumsum(c(1,a$lengths[-n]))[idx])\n}\n\nset.seed(42)\ntable = rep(1:1e5, each=10)\nx = sample(1:100, 1000, replace = TRUE)\nbench::mark(\n faster(x, table),\n #mymatch(x, table) |> as.vector(),\n join_match(x, table),\n #unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),\n check = TRUE\n )\n\n\n# A tibble: 2 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory \n <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> \n1 faster(x,\xe2\x80\xa6 54.4ms 252ms 3.97 54.9MB 1.99 2 1 503ms <int> <Rprofmem>\n2 join_matc\xe2\x80\xa6 127.7ms 254ms 3.93 88.8MB 5.90 2 3 508ms <int> <Rprofmem>\n# \xe2\x84\xb9 2 more variables: time <list>, gc <list>\nRun Code Online (Sandbox Code Playgroud)\n只要表已排序,该函数就可以工作。不一定是 1:n。
\ntable = c(rep(1:5, each=3), 7,7,7,7,10,10)\nx = c(10, 2, 5,7, 2, 6)\n\nmicrobenchmark::microbenchmark(\n faster(x, table),\n #mymatch(x, table) |> as.vector(),\n join_match(x, table),\n #unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),\n check = 'equal'\n )\nUnit: microseconds\n expr min lq mean median uq max neval\n faster(x, table) 23.001 32.751 56.95703 56.400 66.201 222.901 100\n join_match(x, table) 4216.201 4925.302 6616.51401 5572.951 7842.200 21153.402 100\nRun Code Online (Sandbox Code Playgroud)\n
作为加入应该更快。这快了 100 倍以上。
\nlibrary(dplyr)\njoin_match = function(x, table) {\n t <- data.frame(table, index = 1:length(table))\n xt <- data.frame(x, index = 1:length(x))\n t |>\n left_join(xt, join_by(table == x), relationship = 'many-to-many') |>\n arrange(index.y) %>%\n filter(!is.na(index.y)) %>%\n pull(index.x)\n}\nRun Code Online (Sandbox Code Playgroud)\n输出相同,速度快 100-200 倍,比 @Onyambu 的基本 R 建议快约 3 倍(注意:该方法已更新为类似的速度,并且 data.table 解决方案甚至更快。使用 duckdb 或 arrow ,或者折叠起来进行连接可能会更快。但我的观察仍然是,通过将其视为连接,您可以获得显着的速度提高和易读性):
\nset.seed(42)\ntable = rep(1:1e5, each=10)\nx = sample(1:100, 1000, replace = TRUE)\nbench::mark(\n mymatch(x, table) |> as.vector(),\n join_match(x, table),\n unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),\n check = TRUE\n)\n\n# A tibble: 3 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>\n1 as.vector(mymatch(x, table)) 13.8s 13.8s 0.0727 14.9GB 2.83 1 39 13.8s <int> \n2 join_match(x, table) 48.7ms 62.2ms 13.8 88.8MB 3.95 7 2 506.3ms <int> \n3 unlist(split(seq(table), table)[as.character(x)], use.names = FALSE) 183.6ms 184.5ms 5.31 29.8MB 0 3 0 564.9ms <int> \nRun Code Online (Sandbox Code Playgroud)\n
也许 data.table 是一个选择?如果您有相对较大的表/向量,您可能会看到速度的提高,特别是如果您采用 Jon Spring 的“连接”方法:
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#>
#> hour, isoweek, mday, minute, month, quarter, second, wday, week,
#> yday, year
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
library(microbenchmark)
onyambu_faster <- function(x, table){
a <- rle(table)
n <- length(a$lengths)
idx <- match(x, a$values, 0)
sequence(a$lengths[idx], cumsum(c(1,a$lengths[-n]))[idx])
}
jon_spring_join_match = function(x, table) {
t <- data.frame(table, index = 1:length(table))
xt <- data.frame(x, index = 1:length(x))
t |>
left_join(xt, join_by(table == x), relationship = 'many-to-many') |>
arrange(index.y) %>%
filter(!is.na(index.y)) %>%
pull(index.x)
}
jared_mamrot_dt <- function(x, table){
table_dt <- data.table(table, index = 1:length(table))
x_dt <- data.table(x, index = 1:length(x))
return(na.omit(table_dt[x_dt, on = .(table == x)][,index]))
}
table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)
all.equal(onyambu_faster(x, table), jared_mamrot_dt(x, table))
#> [1] TRUE
all.equal(jon_spring_join_match(x, table), jared_mamrot_dt(x, table))
#> [1] TRUE
res <- microbenchmark(onyambu_faster(x, table),
jon_spring_join_match(x, table),
jared_mamrot_dt(x, table),
times = 10)
res
#> Unit: milliseconds
#> expr min lq mean median
#> onyambu_faster(x, table) 38.196317 45.08884 65.22651 52.40748
#> jon_spring_join_match(x, table) 48.697968 74.54407 105.79551 83.11473
#> jared_mamrot_dt(x, table) 9.441176 11.34315 12.99648 11.76324
#> uq max neval cld
#> 64.88688 129.38505 10 a
#> 131.50681 221.16477 10 b
#> 14.05289 21.84779 10 c
autoplot(res)
Run Code Online (Sandbox Code Playgroud)

创建于 2023 年 10 月 26 日,使用reprex v2.0.2
使用问题中的数据,根据我的机器上的中值时间,速度是原来的两倍。
table = rep(1:5, each=3)
x = c(2, 5, 2, 6)
mymatch = function(x, table) {
matches = sapply(x, \(xx) which(table %in% xx))
unlist(matches)
}
outer_match <- function(x, table) {
z1 <- outer(table, x, "==")
z2 <- z1 * row(z1)
z2[z2 != 0]
}
outer_match(x, table)
## [1] 4 5 6 13 14 15 4 5 6
library(microbenchmark)
microbenchmark(
mymatch(x, table),
outer_match(x, table)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## mymatch(x, table) 77.0 79.15 166.696 82.75 84.3 8384.9 100 a
## outer_match(x, table) 35.1 36.75 115.783 41.95 43.1 7410.1 100 a
Run Code Online (Sandbox Code Playgroud)