strsplit到data.frame与不完整的输入

Question

strsplit到data.frame与不完整的输入

我尝试将一个字符串向量分成一个data.frame对象,对于一个固定的顺序,这不是一个问题(例如像这里写的那样),但在我的特定情况下,未来数据框的列在字符串对象.这是玩具输入的输出结果:

input <- c("an=1;bn=3;cn=45",
           "bn=3.5;cn=76",
           "an=2;dn=5")

res <- do.something(input)

> res
      an  bn  cn  dn
[1,]  1   3   45  NA
[2,]  NA  3.5 76  NA
[3,]  2   NA  NA  5

Run Code Online (Sandbox Code Playgroud)

我现在正在寻找do.something能够以有效方式实现这一目标的功能.我当时的幼稚的解决办法是循环输入对象,strsplit那些;然后strsplit他们再次=,然后填写data.frame的结果的结果.有没有办法做更多R-like类似的？我担心按元素执行元素会花费很长时间才能生成长向量input.

编辑:为了完整,我天真的解决方案看起来像这样:

  do.something <- function(x){
    temp <- strsplit(x,";")
    temp2 <- sapply(temp,strsplit,"=")
    ul.temp2 <- unlist(temp2)
    label <- sort(unique(ul.temp2[seq(1,length(ul.temp2),2)]))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(label)))
    colnames(res) <- label
    for(i in 1:length(temp)){
      for(j in 1:length(label)){
        curInfo <- unlist(temp2[[i]])
        if(sum(is.element(curInfo,label[j]))>0){
          res[i,j] <- curInfo[which(curInfo==label[j])+1]
        }
      }
    }
    res
  }

Run Code Online (Sandbox Code Playgroud)

编辑2:不幸的是我的大输入数据看起来像这样(没有'='的条目):

input <- c("an=1;bn=3;cn=45",
           "an;bn=3.5;cn=76",
           "an=2;dn=5")

Run Code Online (Sandbox Code Playgroud)

所以我无法比较我手头的问题给出的答案.我天真的解决方案是

do.something <- function(x){
    temp <- strsplit(x,";")
    tempNames <- sort(unique(sapply(strsplit(unlist(temp),"="),"[",1)))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(tempNames)))
    colnames(res) <- tempNames

    for(i in 1:length(temp)){
      curSplit <- strsplit(unlist(temp[[i]]),"=")
      curNames <- sapply(curSplit,"[",1)
      curValues <- sapply(curSplit,"[",2)
      for(j in 1:length(tempNames)){
        if(is.element(colnames(res)[j],curNames)){
          res[i,j] <- curValues[curNames==colnames(res)[j]]
        }
      }
    }
    res
  }

Run Code Online (Sandbox Code Playgroud)

Answer 1

Sim*_*lon 4

即使给定您编辑过的数据，这是另一种方法也应该有效。使用从输入向量中提取列名称和值regmatches，然后遍历将值与相应列名称相匹配的每个列表元素。

#  Get column names
tag <- regmatches( input , gregexpr( "[a-z]+" , input ) )

#  Get numbers including floating point, replace missing values with NA
val <- regmatches( input , gregexpr( "\\d+\\.?\\d?|(?<=[a-z]);" , input , perl = TRUE ) )
val <- lapply( val , gsub , pattern = ";" , replacement = NA )

#  Column names
nms <- unique( unlist(tag) )

#  Intermeidate matrices
ll <- mapply( cbind , val , tag )

#  Match to appropriate columns and coerce to data.frame
out <- data.frame( do.call( rbind , lapply( ll , function(x) x[ match( nms , x[,2] ) ]  ) ) )
names(out) <- nms
#    an   bn   cn   dn
#1    1    3   45 <NA>
#2 <NA>  3.5   76 <NA>
#3    2 <NA> <NA>    5

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，1 月前
查看次数：	719 次
最近记录：	12 年，1 月前