拆分数据框内一行内的列字符串元素

Jul*_*uly 7 split r dataframe

我有一个像这样的矩阵(1000 x 2830):

        9178    3574    3547
160     B_B     B_B      A_A
301     B_B     A_B      A_B
303     B_B     B_B      A_A
311     A_B     A_B      A_A
312     B_B     A_B      A_A
314     B_B     A_B      A_A
Run Code Online (Sandbox Code Playgroud)

我想获得以下内容(复制colnames并拆分每列的每个元素):

      9178   9178   3574   3574   3547   3547
160     B      B      B      B      A      A
301     B      B      A      B      A      B
303     B      B      B      B      A      A
311     A      B      A      B      A      A
312     B      B      A      B      A      A
314     B      B      A      B      A      A
Run Code Online (Sandbox Code Playgroud)

我尝试使用,strsplit但我收到错误消息,因为这是一个矩阵,而不是一个字符串.你能提出一些解决这个问题的想法吗?

tal*_*lat 7

这是一个使用dplyr(for bind_cols)和tidyr(for separate_)以及lapply来自基础R 的选项.它假设您的数据是data.frame(即您可能需要先将其转换为data.frame):

library(dplyr)
library(tidyr)

lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% 
  bind_cols
#  X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2
#1       B       B       B       B       A       A
#2       B       B       A       B       A       B
#3       B       B       B       B       A       A
#4       A       B       A       B       A       A
#5       B       B       A       B       A       A
#6       B       B       A       B       A       A
Run Code Online (Sandbox Code Playgroud)


A5C*_*2T1 6

我有偏见,但我建议使用cSplit我的"splitstackshape"包.由于rownames您的输入中似乎有,请使用as.data.table(., keep.rownames = TRUE):

library(splitstackshape)
cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_")
#     rn X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2
# 1: 160       B       B       B       B       A       A
# 2: 301       B       B       A       B       A       B
# 3: 303       B       B       B       B       A       A
# 4: 311       A       B       A       B       A       A
# 5: 312       B       B       A       B       A       A
# 6: 314       B       B       A       B       A       A
Run Code Online (Sandbox Code Playgroud)

不太清晰cSplit(但目前可能更快)将使用stri_split_fixed"stringi",如下所示:

library(stringi)
`dimnames<-`(do.call(cbind, 
                     lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), 
             list(rownames(mydf), rep(colnames(mydf), each = 2)))
#     X9178 X9178 X3574 X3574 X3547 X3547
# 160 "B"   "B"   "B"   "B"   "A"   "A"  
# 301 "B"   "B"   "A"   "B"   "A"   "B"  
# 303 "B"   "B"   "B"   "B"   "A"   "A"  
# 311 "A"   "B"   "A"   "B"   "A"   "A"  
# 312 "B"   "B"   "A"   "B"   "A"   "A"  
# 314 "B"   "B"   "A"   "B"   "A"   "A" 
Run Code Online (Sandbox Code Playgroud)

如果速度至关重要,我建议查看"iotools"包,特别是mstrsplit功能.该方法类似于"stringi"方法:

library(iotools)
`dimnames<-`(do.call(cbind, 
                lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")),
             list(rownames(mydf), rep(colnames(mydf), each = 2)))
Run Code Online (Sandbox Code Playgroud)

lapply(mydf, as character)如果你stringsAsFactors = FALSE在从a转换为a时忘记使用,你可能需要添加matrix一个data.frame,但它仍然应该击败甚至stri_split方法.