将字符串分解为不同行上的多个字符串

Sam*_*bus 2 r character dataframe

我有一个数据框,其中包含一个长字符串,每个字符串与一个'Sample'相关联:

Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
Run Code Online (Sandbox Code Playgroud)

我想用一种简单的方法将这个字符串打成5个片段,格式如下:

Sample X
CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168
Run Code Online (Sandbox Code Playgroud)

为每个样本提供如下所示的输出:

Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N
Run Code Online (Sandbox Code Playgroud)

我已经能够使用该substr函数将长字符串分解为单个部分,但是id能够自动化它,因此我可以在一个输出中获得所有5个部分.理想情况下,此输出也是数据帧.

mds*_*ner 5

?read.fwf是为了什么.

首先是一些看起来像你的问题的数据:

x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)

现在使用read.fwf,指定每个字段的宽度及其名称,并且所有字段都应该是模式character.我们将示例数据的文本列包装起来,textConnection以便我们可以将其视为通常由read.*其他函数理解的连接.

(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15")))


                               CCT6                                GAT1                            IMD3                            PDR3                                  RIM15
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
Run Code Online (Sandbox Code Playgroud)

现在循环遍历行并按照您的示例打印出每个行:

for (i in 1:nrow(strs)) {
  writeLines(paste("Sample", i))
  writeLines(paste(names(strs), strs[i, ], sep = " - "))
}
Run Code Online (Sandbox Code Playgroud)

举例来说:

Sample 2
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N
Run Code Online (Sandbox Code Playgroud)