R - 将一串重复氨基酸从例如NNNN转换为(N4)

K. *_*mae 8 regex r

我正在计算来自多个序列比对的单倍型,并且正在获得一系列重复序列,例如RNNNNNNNT和RNNNT.有许多变化,使得很难理解数据.

数据如下所示,我是否有兴趣基于haplotypes_1生成列haplotypes_2,如下所示:

hap_code    haplotypes_1          haplotypes_2
  1    SKNNNRNNNNNKNNNNNNNKF    SK(N3)R(N5)K(N7)KF
  2    SKNNNNNNNNNKNNNNNNNNKF   SK(N9)K(N8)KF
  3    SKNNNNNNNNNNNNNNNNKF     SK(N16)KF
Run Code Online (Sandbox Code Playgroud)

man*_*ark 7

使用stringr和自定义功能:

library(stringr)

replace_string <- function(x) {
  sprintf("(%s%i)", str_sub(x, end = 1L), str_length(x))
}

df1$hapnew <- str_replace_all(df1$haplotypes_1, "N+", replace_string)

  hap_code           haplotypes_1       haplotypes_2             hapnew
1        1  SKNNNRNNNNNKNNNNNNNKF SK(N3)R(N5)K(N7)KF SK(N3)R(N5)K(N7)KF
2        2 SKNNNNNNNNNKNNNNNNNNKF      SK(N9)K(N8)KF      SK(N9)K(N8)KF
3        3   SKNNNNNNNNNNNNNNNNKF          SK(N16)KF          SK(N16)KF
Run Code Online (Sandbox Code Playgroud)


d.b*_*d.b 5

x = c("SKNNNRNNNNNKNNNNNNNKF", "SKNNNNNNNNNKNNNNNNNNKF", "SKNNNNNNNNNNNNNNNNKF")
sapply(strsplit(x, ""), function(mystr)
    with(rle(mystr),
         paste(paste0(ifelse(lengths > 1, paste0("(",values), values),
                      ifelse(lengths > 1, paste0(lengths,")"), ""),
                      collapse = ""))))
#[1] "SK(N3)R(N5)K(N7)KF" "SK(N9)K(N8)KF"      "SK(N16)KF"        
Run Code Online (Sandbox Code Playgroud)


Nat*_*rth 1

几乎与 @db 完全相同,但转换为几个函数,因此它可重用且对读者友好:

abbreviate_letters <- function(type_letters) {
  runs <- rle(type_letters)
  run_codes <- ifelse(
    runs[["lengths"]] == 1,
    yes = runs[["values"]],
    no  = paste0("(", runs[["values"]], runs[["lengths"]], ")")
  )
  paste0(run_codes, collapse = "")
}


condense_haplotype <- function(haplotype_long) {
  split_terms <- strsplit(haplotype_long, split = "")
  vapply(
    X         = split_terms,
    FUN       = abbreviate_letters,
    FUN.VALUE = character(1)
  )
}


haplotypes <- c(
  "SKNNNRNNNNNKNNNNNNNKF",
  "SKNNNNNNNNNKNNNNNNNNKF",
  "SKNNNNNNNNNNNNNNNNKF"
)

condense_haplotype(haplotypes)
# [1] "SK(N3)R(N5)K(N7)KF" "SK(N9)K(N8)KF"      "SK(N16)KF"
Run Code Online (Sandbox Code Playgroud)