这个问题只是要求在R中实现以下问题:在一组字符串中找到最长的公共起始子字符串(JavaScript)
" 这个问题是最长公共子串问题的一个更具体的例子.我只需要在数组中找到最长的公共起始子串 ".
所以我只是看一个这个问题的R实现(最好不是在 JavaScript版本中建议的for/while循环方式),如果可能的话我想把它作为一个函数包装起来,所以我可以在很多组中应用数据表.
经过一些搜索,我找不到一个R的例子,因此这个问题.
示例数据:我有以下字符向量:
dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
Run Code Online (Sandbox Code Playgroud)
我想在R中运行一个算法,它将找到以下输出:ADA44.
从我在JavaScript接受的答案中看到的,我们的想法是首先对向量进行排序,提取第一个和最后一个元素(例如:"ADA4417-3ARMZ-R7"和"ADA4432-1BRJZ-R2"它们分成单个字符,并循环遍历它们,直到其中一个字符为'匹配(希望我对)
对此的任何帮助都会很棒!
从您的建议中获取灵感,您可以尝试以下功能:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}
Run Code Online (Sandbox Code Playgroud)
UPDATE
遵循@nicola建议,该函数的更简单,更优雅的变体:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}
Run Code Online (Sandbox Code Playgroud)
例子:
随你的数据
x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"
Run Code Online (Sandbox Code Playgroud)
当没有共同的起始子串时
x<-c("abc","def")
> comsub(x)
# character(0)
Run Code Online (Sandbox Code Playgroud)
一种非base替代方法,使用lcprefix函数 inBiostrings查找“两个字符串的最长公共前缀 [...]”
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"
Run Code Online (Sandbox Code Playgroud)