如何将数据框中的每个列拆分为两列?

mah*_*ood 4 loops r strsplit rbind

我有一个这样的数据框(4行和5列):

Marker ind1 ind2 ind3 ind4
mark1             CT             TT             CT             TT
mark2             AG             AA             AG             AA
mark3             AC             AA             AC             AA
mark4             CT             TT             CT             TT
Run Code Online (Sandbox Code Playgroud)

我想要做的是将每个列(第一个coloumn除外)拆分为两列.所以输出应该像这样(4行9列):

Marker ind1 ind1 ind2 ind2 ind3 ind3 ind4 ind4
mark1             C T             T T             C T             T T
mark2             A G             A A             A G             A A
mark3             A C             A A             A C             A A
mark4             C T             T T             C T             T T
Run Code Online (Sandbox Code Playgroud)

我知道如何拆分一列

do.call(rbind,strsplit(test$JRP4RA6119.039, ""))
Run Code Online (Sandbox Code Playgroud)

这给了这个:

      [,1] [,2]
 [1,] "C"  "T" 
 [2,] "A"  "G" 
 [3,] "A"  "C" 
 [4,] "C"  "T" 
Run Code Online (Sandbox Code Playgroud)

我想要的是能够循环这个并为一个数据帧中的所有列.

提前致谢.

Cat*_*ath 5

我觉得这有点牵强,但是:

test_split <- data.frame(Marker=test$Marker, 
                         do.call("cbind", lapply(apply(test[, -1], 2, strsplit, ""), 
                                                 function(x) do.call("rbind", x))), 
                         stringsAsFactors=F)
colnames(test_split)[-1] <- paste(rep(colnames(test)[-1], e=2), 1:2, sep="_")

test_split
#      Marker JRP4RA6119.039_1 JRP4RA6119.039_2 JRP4RA6124.029_1 JRP4RA6124.029_2 JRP4RA6133.051_1 JRP4RA6133.051_2 JRP4RA6125.009_1 JRP4RA6125.009_2
#1 s7e4419xxx                C                T                T                T                C                T                T                T
#2 s7e7001s01                A                G                A                A                A                G                A                A
#3 s7e3049xxx                A                C                A                A                A                C                A                A
#4 s7e4727xxx                C                T                T                T                C                T                T                T
Run Code Online (Sandbox Code Playgroud)


akr*_*run 5

您也可以尝试cSplit_fsplitstackshape

library(splitstackshape)
df1[-1] <- lapply(df1[-1] , function(x)
        gsub('(?<=\\w)(?=\\w)', ',', x, perl=TRUE))
 cSplit_f(df1, 2:ncol(df1), sep=',')
#   Marker ind1_1 ind1_2 ind2_1 ind2_2 ind3_1 ind3_2 ind4_1 ind4_2
#1:  mark1      C      T      T      T      C      T      T      T
#2:  mark2      A      G      A      A      A      G      A      A
#3:  mark3      A      C      A      A      A      C      A      A
#4:  mark4      C      T      T      T      C      T      T      T
Run Code Online (Sandbox Code Playgroud)

或者正如@Ananda Mahto建议的那样,cSplit对大型数据集可能更有效,并且可以直接使用它而无需更改分隔符.

cSplit(df1, names(df1)[-1], sep="", stripWhite = FALSE)
#   Marker ind1_1 ind1_2 ind2_1 ind2_2 ind3_1 ind3_2 ind4_1 ind4_2
#1:  mark1      C      T      T      T      C      T      T      T
#2:  mark2      A      G      A      A      A      G      A      A
#3:  mark3      A      C      A      A      A      C      A      A
#4:  mark4      C      T      T      T      C      T      T      T
Run Code Online (Sandbox Code Playgroud)

或者使用tstrsplit来自data.table

library(data.table)#v1.9.5+
setDT(df1)
cbind(Marker=df1$Marker,df1[, unlist(lapply(.SD, function(x)
        tstrsplit(x, '')), recursive=FALSE), .SDcols=-1])
#   Marker ind11 ind12 ind21 ind22 ind31 ind32 ind41 ind42
#1:  mark1     C     T     T     T     C     T     T     T
#2:  mark2     A     G     A     A     A     G     A     A
#3:  mark3     A     C     A     A     A     C     A     A
#4:  mark4     C     T     T     T     C     T     T     T
Run Code Online (Sandbox Code Playgroud)

数据

df1 <- structure(list(Marker = c("mark1", "mark2", "mark3", "mark4"), 
ind1 = c("CT", "AG", "AC", "CT"), ind2 = c("TT", "AA", "AA", 
"TT"), ind3 = c("CT", "AG", "AC", "CT"), ind4 = c("TT", "AA", 
"AA", "TT")), .Names = c("Marker", "ind1", "ind2", "ind3", 
"ind4"), class = "data.frame", row.names = c(NA, -4L))
Run Code Online (Sandbox Code Playgroud)

  • @akrun,不,它不会.这就是为什么我建议`cSplit` :-) (4认同)
  • @SamFirke,我实际上只是推荐`cSplit(df1,names(df1)[ - 1],"",stripWhite = FALSE)`因为在使用`cSplit_f`时可能存在非常大的数据集的内存问题.table"预分配列. (3认同)