转换数组,其中元素具有不同数量的逗号分隔数据到data.frame

use*_*158 2 performance r

我有一个冗长的数组,其中每个元素都有逗号分隔的数据.每个元素中的第一条数据是标识符ID.其余数据由两组数据组成.我们将它们称为X和Z.这是一个例子.

我的数组看起来像这样:

ABC,1,1.5,2,2.4,3,3.1
DEF,1,1.7,2,0.9
GHI,3,8.2
JKL,1,1.5,2,2.4,3,3.13,8.2
Run Code Online (Sandbox Code Playgroud)

我希望它将它转换为如下所示的数据框:

    ID X   Z
1  ABC 1 1.5
2  ABC 2 2.4
1  ABC 1 1.5
2  ABC 2 2.4
3  ABC 3 3.1
4  DEF 1 1.7
5  DEF 2 0.9
6  GHI 3 8.2
7  JKL 1 1.5
8  JKL 2 2.4
9  JKL 3 3.1
10 JKL 3 8.2
Run Code Online (Sandbox Code Playgroud)

我写了一个循环(粘贴在下面),它将为我做这个,但它是非常低效的.我的数组接近1,000,000个元素,并且循环正在进行中.任何的意见都将会有帮助.我仍然熟悉应用函数,但我不知道在这种情况下如何写一个.

在此代码中,具有原始数据的数组称为"行".

#write a function to test if an itteration is even
is.even <- function(x) x %% 2 == 0 

#create the dataframe, empty so I can add to it;
dfA <- data.frame()
itter <- 0

#start loop over lines of the array;
for (i in 1:length(lines)){
  #for (i in 1:10){
  itter <- itter + 1
  print(itter)
  line_data <- strsplit(lines[i], ",")
  #start loop over list containing data elements in each line
  for (j in 1:length(line_data[[1]])){

    filename = line_data[[1]][1]
    if (j>1){
      if (is.even(j)){
        X = as.numeric(line_data[[1]][j])
        Z = as.numeric(line_data[[1]][j+1])

        if (!exists("dfA")){
          dfA = data.frame("ID" = filename, "X" = X, "Z"=Z)
        }
        else{
          df_temp = data.frame("ID" = filename, "X" = X, "Z"=Z)
          dfA = rbind(dfA, df_temp)
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

A5C*_*2T1 5

我建议编写如下函数:

tableMaker <- function(invec) {
  ## http://stackoverflow.com/q/30528592/1270695
  require(data.table)
  ## Split up the vector
  temp <- strsplit(invec, ",", TRUE)
  ## How long is each vector?
  a <- lengths(temp)
  ## Which vectors need adjustment?
  ind <- which(a %% 2 == 0)
  ## Adjust only those that need adjustment
  temp[ind] <- lapply(temp[ind], function(x) {
    c(x[1:(length(x)-1)], x[length(x)-2], x[length(x)])
  })
  ## Recalculate lengths
  a <- lengths(temp)
  ## Figure out where the IDs are
  a2 <- c(1, cumsum(a[-length(a)]) + 1)
  ## Unlist the data
  tempUL <- unlist(temp)
  ## Grab the IDs and repeat them to the necessary length
  ID <- rep(tempUL[a2], a/2)
  ## Make a 2 column matrix from the remaining values
  MAT <- matrix(tempUL[-a2], ncol = 2, byrow = TRUE, 
                dimnames = list(NULL, c("X", "Z")))
  ## Combine it into a data.table and run type.convert
  data.table(ID, MAT)[, lapply(.SD, type.convert)]
}
Run Code Online (Sandbox Code Playgroud)

用法将是:

tableMaker(dat)
#      ID X    Z
#  1: ABC 1 1.50
#  2: ABC 2 2.40
#  3: ABC 3 3.10
#  4: DEF 1 1.70
#  5: DEF 2 0.90
#  6: GHI 3 8.20
#  7: JKL 1 1.50
#  8: JKL 2 2.40
#  9: JKL 3 3.13
# 10: JKL 3 8.20
Run Code Online (Sandbox Code Playgroud)

在具有1,000,000个值的数据集上,这将在几秒钟内运行:

dat2 <- rep(dat, 1000000/length(dat))

system.time(out <- tableMaker(dat2))
#    user  system elapsed 
#   4.284   0.000   3.053 
out
#           ID X    Z
#       1: ABC 1 1.50
#       2: ABC 2 2.40
#       3: ABC 3 3.10
#       4: DEF 1 1.70
#       5: DEF 2 0.90
# ---           
# 2499996: GHI 3 8.20
# 2499997: JKL 1 1.50
# 2499998: JKL 2 2.40
# 2499999: JKL 3 3.13
# 2500000: JKL 3 8.20
Run Code Online (Sandbox Code Playgroud)

我没有测试@Roland对1M值的处理方法,但这里是对1000个值的比较:

library(microbenchmark)
dat1000 <- rep(dat, 1000/length(dat))
microbenchmark(tableMaker(dat1000), roland(dat1000), times = 10)
# Unit: milliseconds
#                 expr        min        lq       mean     median         uq        max neval
#  tableMaker(dat1000)   2.346186   2.53734   2.647924   2.573726   2.730987   3.048823    10
#      roland(dat1000) 382.857587 391.46409 406.925600 402.442203 412.203468 452.420665    10
Run Code Online (Sandbox Code Playgroud)

这里有10k值:

dat10k <- rep(dat, 10000/length(dat))
microbenchmark(tableMaker(dat10k), roland(dat10k), times = 5)
# Unit: milliseconds
#                expr        min         lq       mean     median         uq        max neval
#  tableMaker(dat10k)   19.24391   22.51366   24.57222   23.43996   27.11431   30.54927     5
#      roland(dat10k) 6286.45480 6324.42184 6497.16173 6325.32259 6355.39668 7194.21274     5
Run Code Online (Sandbox Code Playgroud)