考虑这两个字符串:
string1 <- "GCTCCC...CTCCATGAAGTA...CTTCACATCCGTGT.CCGGCCTGGCCGCGGAGAGCCC"
string_reference <- "GCTCCC...CTCCATGAAGTATTTCTTCACATCCGTGT.CCGGCCTGGCCGCGGAGAGCCC"
Run Code Online (Sandbox Code Playgroud)
如何轻松删除"string1"中的点,但只删除"string_reference"中位于相同位置的点?
预期产量:
string1 = "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
Run Code Online (Sandbox Code Playgroud)
我只是使用R的真正矢量化子集和逻辑比较方法......
# Split the strings
x <- strsplit( c( string1 , string_reference ) , "" )
# Compare and remove dots from string1 when dots also appear in the reference string at the same position
paste( x[[1]][ ! (x[[2]]== "." & x[[1]] == ".") ] , collapse = "" )
#[1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
Run Code Online (Sandbox Code Playgroud)
与罗伯特相似,但是"矢量化"版本:
s1 <- unlist(strsplit(string1, ""))
s2 <- unlist(strsplit(string_reference, ""))
paste0(Filter(Negate(is.na), ifelse(s1 == s2 & s1 == ".", NA, s1)), collapse="")
# [1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
Run Code Online (Sandbox Code Playgroud)
我引用"矢量化",因为矢量化发生在字符串向量的字符上.这假设您的字符串向量中只有一个元素.如果你的字符串向量中有多个元素,则必须遍历结果strsplit.
使用intersect找到重叠.的
cutpos <- do.call(intersect,
sapply(list(string_reference,string1), gregexpr, pattern=".", fixed=TRUE)
)
paste(strsplit(string1,"",fixed=TRUE)[[1]][-cutpos],collapse="")
#[1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
Run Code Online (Sandbox Code Playgroud)
上面的一小部分(@Arun提供):
attr(cutpos, 'match.length') <- rep(1L, length(cutpos))
attr(cutpos, 'useBytes') <- TRUE
do.call(paste0, c(regmatches(string1, list(cutpos), invert=TRUE), collapse=""))
## [1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
Run Code Online (Sandbox Code Playgroud)