R中检查两个等长序列的匹配位置的最快方法

sca*_*der 5 string r dplyr tidyverse

我有以下函数,它基本上检查两个字符串的匹配位置的百分比:

library(tidyverse)
calculate_sequ_identity <- function(sequ_1 = NULL, sequ_2 = NULL) {
  #   # Two sequs must be of the same length
  # sequ_1 <- "FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH" 
  # sequ_2 <- "FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS" 

  
  try(if (nchar(sequ_1) != nchar(sequ_2)) stop("sequ of different length"))

  
  seq1_dat <- as_tibble(unlist(str_split(string = sequ_1, pattern = ""))) %>%
    dplyr::rename(res = 1) %>%
    dplyr::rename(res.seq1 = 1)
  
  seq2_dat <- as_tibble(unlist(str_split(string = sequ_2, pattern = ""))) %>%
    dplyr::rename(res = 1) %>%
    dplyr::rename(res.seq2 = 1)
  
  final_dat <- bind_cols(seq1_dat, seq2_dat) %>%
    rowwise() %>%
    mutate(identity_status = if_else(res.seq1 == res.seq2, 1, 0)) %>%
    unnest(cols = c()) %>%
    mutate(res_no = row_number()) 
  
  identity <- sum(final_dat$identity_status) / nchar(sequ_1)
  
  identity
}
Run Code Online (Sandbox Code Playgroud)

以此为例:

 sequ_1: FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH
         ||| || |              |           # 7 matches of 30 char seqs
 sequ_2: FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS
Run Code Online (Sandbox Code Playgroud)

匹配恒等式为 7/30 = 0.23。

但我不确定这是否是最快的例程。请指教有什么快速的计算方法。通常我有数百万双需要检查。

当前基准:

rbenchmark::benchmark(

  "m1" = {calculate_sequ_identity(sequ_1 = sequ_1, sequ_2 = sequ_2)},
  replications = 100,
  columns = c("test", "replications", "elapsed",
              "relative", "user.self", "sys.self")
  
)
Run Code Online (Sandbox Code Playgroud)

给我

  test replications elapsed relative user.self sys.self
1   m1          100   2.267    1.000     2.228    0.032
Run Code Online (Sandbox Code Playgroud)

jay*_*.sf 2

那这个呢?

f <- \(x, y) {
  stopifnot(nchar(x) == nchar(y))
  matrixStats::colMeans2(mapply(`==`, strsplit(x, ''), strsplit(y, '')))
}

s1 <- c('FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH', 'FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH')
s2 <- c('FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS', 'FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS')
f(s1, s2)
# [1] 0.2333333 0.2333333
Run Code Online (Sandbox Code Playgroud)