Ada*_*amm 5 string r bioinformatics overlap dna-sequence
您知道有什么现成的方法可以获取两个字符串的长度和重叠吗?然而只有与R,也许有什么来自stringr?我正在寻找这里,不幸的是没有成功。
str1 <- 'ABCDE'
str2 <- 'CDEFG'
str_overlap(str1, str2)
'CDE'
str_overlap_len(str1, str2)
3
Run Code Online (Sandbox Code Playgroud)
其他例子:
str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
str_overlap(str1, str2)
'CCTG'
str_overlap_len(str1, str2)
4
Run Code Online (Sandbox Code Playgroud)
///
str1 <- 'foobarandfoo'
str2 <- 'barand'
str_overlap(str1, str2)
'barand'
str_overlap_len(str1, str2)
6
Run Code Online (Sandbox Code Playgroud)
/// 是的,两个解决方案,总是选择总是重叠
str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
str_overlap(str1, str2)
'ABCDE'
str_overlap_len(str1, str2)
5
Run Code Online (Sandbox Code Playgroud)
我想知道是否有自制的小功能,比如这个?
在我看来,您(OP)不太关心代码的性能,而是对在没有现成函数的情况下解决它的潜在方法更感兴趣。这是我想出的一个计算最长公共子串的例子。我必须注意,即使可能有多个相同长度的子串,这也只会返回找到的第一个最大的公共子串。您可以修改此内容以满足您的需求。请不要指望这会超级快——它不会。
foo <- function(str1, str2, ignore.case = FALSE, verbose = FALSE) {
if(ignore.case) {
str1 <- tolower(str1)
str2 <- tolower(str2)
}
if(nchar(str1) < nchar(str2)) {
x <- str2
str2 <- str1
str1 <- x
}
x <- strsplit(str2, "")[[1L]]
n <- length(x)
s <- sequence(seq_len(n))
s <- split(s, cumsum(s == 1L))
s <- rep(list(s), n)
for(i in seq_along(s)) {
s[[i]] <- lapply(s[[i]], function(x) {
x <- x + (i-1L)
x[x <= n]
})
s[[i]] <- unique(s[[i]])
}
s <- unlist(s, recursive = FALSE)
s <- unique(s[order(-lengths(s))])
i <- 1L
len_s <- length(s)
while(i < len_s) {
lcs <- paste(x[s[[i]]], collapse = "")
if(verbose) cat("now checking:", lcs, "\n")
check <- grepl(lcs, str1, fixed = TRUE)
if(check) {
cat("the (first) longest common substring is:", lcs, "of length", nchar(lcs), "\n")
break
} else {
i <- i + 1L
}
}
}
str1 <- 'ABCDE'
str2 <- 'CDEFG'
foo(str1, str2)
# the (first) longest common substring is: CDE of length 3
str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
foo(str1, str2)
# the (first) longest common substring is: CCTG of length 4
str1 <- 'foobarandfoo'
str2 <- 'barand'
foo(str1, str2)
# the (first) longest common substring is: barand of length 6
str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
foo(str1, str2)
# the (first) longest common substring is: ABCDE of length 5
set.seed(2018)
str1 <- paste(sample(c(LETTERS, letters), 500, TRUE), collapse = "")
str2 <- paste(sample(c(LETTERS, letters), 250, TRUE), collapse = "")
foo(str1, str2, ignore.case = TRUE)
# the (first) longest common substring is: oba of length 3
foo(str1, str2, ignore.case = FALSE)
# the (first) longest common substring is: Vh of length 2
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3073 次 |
| 最近记录: |