Compare every*nd symbol of a text string

Lio*_*nir 6 r

the problem is I got large text file. Let it be

 a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")
Run Code Online (Sandbox Code Playgroud)

我需要将此文本中的每个第3个符号与值(例如'c')进行比较,如果为true,我想添加1到计数器i.我想要使​​用,grep但似乎这个功能不符合我的目的.所以我需要你的帮助或建议.

更重要的是,我想从这个字符串中提取某些值到一个向量.例如,我想提取4:10符号,例如

 a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")
[1] "gatcgatcga"
Run Code Online (Sandbox Code Playgroud)

先感谢您.

PS

我知道在R中编写我需要的脚本不是最好的主意,但我很好奇是否有可能以适当的方式编写脚本.

Jos*_*ien 7

编辑为大字符串提供快速解决方案:

如果你有一个非常长的字符串(大约数百万个核苷酸),我原来的答案(下面)中的lookbehind断言太慢而不实用.在这种情况下,使用更像下面的内容,其中:(1)在每个字符之间拆分字符串; (2)使用字符填充三行矩阵; 然后(3)提取矩阵第3行中的字符.这需要大约0.2秒来处理300万字符长的字符串.

## Make a 3-million character long string
a <- paste0(sample(c("a", "t", "c", "g"), 3e6, replace=TRUE), collapse="")

## Extract the third codon of each triplet
n3  <- matrix(strsplit(a, "")[[1]], nrow=3)[3,]

## Check that it works
sum(n3=="c")
# [1] 250431
table(n3)
#  n3
#      a      c      g      t 
# 250549 250431 249008 250012 
Run Code Online (Sandbox Code Playgroud)

原始答案:

我可以substr()在两种情况下使用.

## Split into codons. (The "lookbehind assertion", "(?<=.{3})" matches at each
## inter-character location that's preceded by three characters of  any type.)
codons <- strsplit(a, "(?<=.{3})", perl=TRUE)[[1]]
#  [1] "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg"

## Extract 3rd nucleotide in each codon
n3 <- sapply(codons, function(X) substr(X,3,3))
# atc gat cga tcg atc gat cga tcg atc gat cga tcg 
# "c" "t" "a" "g" "c" "t" "a" "g" "c" "t" "a" "g" 

## Count the number of 'c's
sum(n3=="c")
# [1] 3


## Extract nucleotides 4-10
substr(a, 4,10)
# [1] "gatcgat"
Run Code Online (Sandbox Code Playgroud)

  • 当然,如果您要对基因组数据进行大量"实际工作",请查看[Bioconductor项目](http://www.bioconductor.org/) (3认同)

Sve*_*ein 1

将每三个字符与 进行比较"c"

grepl("^(.{2}c)*.{0,2}$", a)
# [1] FALSE
Run Code Online (Sandbox Code Playgroud)

提取字符 4 到 10:

substr(a, 4, 10)
# [1] "gatcgat"
Run Code Online (Sandbox Code Playgroud)