sca*_*der 5 string r dplyr tidyverse
我有以下函数,它基本上检查两个字符串的匹配位置的百分比:
library(tidyverse)
calculate_sequ_identity <- function(sequ_1 = NULL, sequ_2 = NULL) {
# # Two sequs must be of the same length
# sequ_1 <- "FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH"
# sequ_2 <- "FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS"
try(if (nchar(sequ_1) != nchar(sequ_2)) stop("sequ of different length"))
seq1_dat <- as_tibble(unlist(str_split(string = sequ_1, pattern = ""))) %>%
dplyr::rename(res = 1) %>%
dplyr::rename(res.seq1 = 1)
seq2_dat <- as_tibble(unlist(str_split(string = sequ_2, pattern = ""))) %>%
dplyr::rename(res = 1) %>%
dplyr::rename(res.seq2 = 1)
final_dat <- bind_cols(seq1_dat, seq2_dat) %>%
rowwise() %>%
mutate(identity_status = if_else(res.seq1 == res.seq2, 1, 0)) %>%
unnest(cols = c()) %>%
mutate(res_no = row_number())
identity <- sum(final_dat$identity_status) / nchar(sequ_1)
identity
}
Run Code Online (Sandbox Code Playgroud)
以此为例:
sequ_1: FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH
||| || | | # 7 matches of 30 char seqs
sequ_2: FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS
Run Code Online (Sandbox Code Playgroud)
匹配恒等式为 7/30 = 0.23。
但我不确定这是否是最快的例程。请指教有什么快速的计算方法。通常我有数百万双需要检查。
当前基准:
rbenchmark::benchmark(
"m1" = {calculate_sequ_identity(sequ_1 = sequ_1, sequ_2 = sequ_2)},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self")
)
Run Code Online (Sandbox Code Playgroud)
给我
test replications elapsed relative user.self sys.self
1 m1 100 2.267 1.000 2.228 0.032
Run Code Online (Sandbox Code Playgroud)
那这个呢?
f <- \(x, y) {
stopifnot(nchar(x) == nchar(y))
matrixStats::colMeans2(mapply(`==`, strsplit(x, ''), strsplit(y, '')))
}
s1 <- c('FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH', 'FKDHKHIDVKDRHRTRHLAKTRCYHIDPHH')
s2 <- c('FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS', 'FKDKKHLDKFSSYHVKTAFFHVCTQNPQDS')
f(s1, s2)
# [1] 0.2333333 0.2333333
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
383 次 |
| 最近记录: |