查找字符串中重叠的长度

Ada*_*amm 5 string r bioinformatics overlap dna-sequence

您知道有什么现成的方法可以获取两个字符串的长度和重叠吗?然而只有与R,也许有什么来自stringr?我正在寻找这里,不幸的是没有成功。

str1 <- 'ABCDE'
str2 <- 'CDEFG'

str_overlap(str1, str2)
'CDE'

str_overlap_len(str1, str2)
3
Run Code Online (Sandbox Code Playgroud)

其他例子:

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'

str_overlap(str1, str2)
'CCTG'

str_overlap_len(str1, str2)
4
Run Code Online (Sandbox Code Playgroud)

///

str1 <- 'foobarandfoo'
str2 <- 'barand'

str_overlap(str1, str2)
'barand'

str_overlap_len(str1, str2)
6
Run Code Online (Sandbox Code Playgroud)

/// 是的,两个解决方案,总是选择总是重叠

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'

str_overlap(str1, str2)
'ABCDE'

str_overlap_len(str1, str2)
5
Run Code Online (Sandbox Code Playgroud)

我想知道是否有自制的小功能,比如这个

tal*_*lat 4

在我看来,您(OP)不太关心代码的性能,而是对在没有现成函数的情况下解决它的潜在方法更感兴趣。这是我想出的一个计算最长公共子串的例子。我必须注意,即使可能有多个相同长度的子串,这也只会返回找到的第一个最大的公共子串。您可以修改此内容以满足您的需求。请不要指望这会超级快——它不会。

foo <- function(str1, str2, ignore.case = FALSE, verbose = FALSE) {

  if(ignore.case) {
    str1 <- tolower(str1)
    str2 <- tolower(str2)
  }

  if(nchar(str1) < nchar(str2)) {
    x <- str2
    str2 <- str1
    str1 <- x
  }

  x <- strsplit(str2, "")[[1L]]
  n <- length(x)
  s <- sequence(seq_len(n))
  s <- split(s, cumsum(s == 1L))
  s <- rep(list(s), n)

  for(i in seq_along(s)) {
    s[[i]] <- lapply(s[[i]], function(x) {
      x <- x + (i-1L)
      x[x <= n]
    })
    s[[i]] <- unique(s[[i]])
  }

  s <- unlist(s, recursive = FALSE)
  s <- unique(s[order(-lengths(s))])

  i <- 1L
  len_s <- length(s)
  while(i < len_s) {
    lcs <- paste(x[s[[i]]], collapse = "")
    if(verbose) cat("now checking:", lcs, "\n")
    check <- grepl(lcs, str1, fixed = TRUE)
    if(check) {
      cat("the (first) longest common substring is:", lcs, "of length", nchar(lcs), "\n")
      break
    } else {
      i <- i + 1L 
    }
  }
}

str1 <- 'ABCDE'
str2 <- 'CDEFG'
foo(str1, str2)
# the (first) longest common substring is: CDE of length 3 

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
foo(str1, str2)
# the (first) longest common substring is: CCTG of length 4

str1 <- 'foobarandfoo'
str2 <- 'barand'
foo(str1, str2)
# the (first) longest common substring is: barand of length 6 

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
foo(str1, str2)
# the (first) longest common substring is: ABCDE of length 5 


set.seed(2018)
str1 <- paste(sample(c(LETTERS, letters), 500, TRUE), collapse = "")
str2 <- paste(sample(c(LETTERS, letters), 250, TRUE), collapse = "")

foo(str1, str2, ignore.case = TRUE)
# the (first) longest common substring is: oba of length 3 

foo(str1, str2, ignore.case = FALSE)
# the (first) longest common substring is: Vh of length 2 
Run Code Online (Sandbox Code Playgroud)