我正在计算来自多个序列比对的单倍型,并且正在获得一系列重复序列,例如RNNNNNNNT和RNNNT.有许多变化,使得很难理解数据.
数据如下所示,我是否有兴趣基于haplotypes_1生成列haplotypes_2,如下所示:
hap_code haplotypes_1 haplotypes_2
1 SKNNNRNNNNNKNNNNNNNKF SK(N3)R(N5)K(N7)KF
2 SKNNNNNNNNNKNNNNNNNNKF SK(N9)K(N8)KF
3 SKNNNNNNNNNNNNNNNNKF SK(N16)KF
Run Code Online (Sandbox Code Playgroud)
使用stringr和自定义功能:
library(stringr)
replace_string <- function(x) {
sprintf("(%s%i)", str_sub(x, end = 1L), str_length(x))
}
df1$hapnew <- str_replace_all(df1$haplotypes_1, "N+", replace_string)
hap_code haplotypes_1 haplotypes_2 hapnew
1 1 SKNNNRNNNNNKNNNNNNNKF SK(N3)R(N5)K(N7)KF SK(N3)R(N5)K(N7)KF
2 2 SKNNNNNNNNNKNNNNNNNNKF SK(N9)K(N8)KF SK(N9)K(N8)KF
3 3 SKNNNNNNNNNNNNNNNNKF SK(N16)KF SK(N16)KF
Run Code Online (Sandbox Code Playgroud)
x = c("SKNNNRNNNNNKNNNNNNNKF", "SKNNNNNNNNNKNNNNNNNNKF", "SKNNNNNNNNNNNNNNNNKF")
sapply(strsplit(x, ""), function(mystr)
with(rle(mystr),
paste(paste0(ifelse(lengths > 1, paste0("(",values), values),
ifelse(lengths > 1, paste0(lengths,")"), ""),
collapse = ""))))
#[1] "SK(N3)R(N5)K(N7)KF" "SK(N9)K(N8)KF" "SK(N16)KF"
Run Code Online (Sandbox Code Playgroud)
几乎与 @db 完全相同,但转换为几个函数,因此它可重用且对读者友好:
abbreviate_letters <- function(type_letters) {
runs <- rle(type_letters)
run_codes <- ifelse(
runs[["lengths"]] == 1,
yes = runs[["values"]],
no = paste0("(", runs[["values"]], runs[["lengths"]], ")")
)
paste0(run_codes, collapse = "")
}
condense_haplotype <- function(haplotype_long) {
split_terms <- strsplit(haplotype_long, split = "")
vapply(
X = split_terms,
FUN = abbreviate_letters,
FUN.VALUE = character(1)
)
}
haplotypes <- c(
"SKNNNRNNNNNKNNNNNNNKF",
"SKNNNNNNNNNKNNNNNNNNKF",
"SKNNNNNNNNNNNNNNNNKF"
)
condense_haplotype(haplotypes)
# [1] "SK(N3)R(N5)K(N7)KF" "SK(N9)K(N8)KF" "SK(N16)KF"
Run Code Online (Sandbox Code Playgroud)