我有一个冗长的数组,其中每个元素都有逗号分隔的数据.每个元素中的第一条数据是标识符ID.其余数据由两组数据组成.我们将它们称为X和Z.这是一个例子.
我的数组看起来像这样:
ABC,1,1.5,2,2.4,3,3.1
DEF,1,1.7,2,0.9
GHI,3,8.2
JKL,1,1.5,2,2.4,3,3.13,8.2
Run Code Online (Sandbox Code Playgroud)
我希望它将它转换为如下所示的数据框:
ID X Z
1 ABC 1 1.5
2 ABC 2 2.4
1 ABC 1 1.5
2 ABC 2 2.4
3 ABC 3 3.1
4 DEF 1 1.7
5 DEF 2 0.9
6 GHI 3 8.2
7 JKL 1 1.5
8 JKL 2 2.4
9 JKL 3 3.1
10 JKL 3 8.2
Run Code Online (Sandbox Code Playgroud)
我写了一个循环(粘贴在下面),它将为我做这个,但它是非常低效的.我的数组接近1,000,000个元素,并且循环正在进行中.任何的意见都将会有帮助.我仍然熟悉应用函数,但我不知道在这种情况下如何写一个.
在此代码中,具有原始数据的数组称为"行".
#write a function to test if an itteration is even
is.even <- function(x) x %% 2 == 0
#create the dataframe, empty so I can add to it;
dfA <- data.frame()
itter <- 0
#start loop over lines of the array;
for (i in 1:length(lines)){
#for (i in 1:10){
itter <- itter + 1
print(itter)
line_data <- strsplit(lines[i], ",")
#start loop over list containing data elements in each line
for (j in 1:length(line_data[[1]])){
filename = line_data[[1]][1]
if (j>1){
if (is.even(j)){
X = as.numeric(line_data[[1]][j])
Z = as.numeric(line_data[[1]][j+1])
if (!exists("dfA")){
dfA = data.frame("ID" = filename, "X" = X, "Z"=Z)
}
else{
df_temp = data.frame("ID" = filename, "X" = X, "Z"=Z)
dfA = rbind(dfA, df_temp)
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
我建议编写如下函数:
tableMaker <- function(invec) {
## http://stackoverflow.com/q/30528592/1270695
require(data.table)
## Split up the vector
temp <- strsplit(invec, ",", TRUE)
## How long is each vector?
a <- lengths(temp)
## Which vectors need adjustment?
ind <- which(a %% 2 == 0)
## Adjust only those that need adjustment
temp[ind] <- lapply(temp[ind], function(x) {
c(x[1:(length(x)-1)], x[length(x)-2], x[length(x)])
})
## Recalculate lengths
a <- lengths(temp)
## Figure out where the IDs are
a2 <- c(1, cumsum(a[-length(a)]) + 1)
## Unlist the data
tempUL <- unlist(temp)
## Grab the IDs and repeat them to the necessary length
ID <- rep(tempUL[a2], a/2)
## Make a 2 column matrix from the remaining values
MAT <- matrix(tempUL[-a2], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("X", "Z")))
## Combine it into a data.table and run type.convert
data.table(ID, MAT)[, lapply(.SD, type.convert)]
}
Run Code Online (Sandbox Code Playgroud)
用法将是:
tableMaker(dat)
# ID X Z
# 1: ABC 1 1.50
# 2: ABC 2 2.40
# 3: ABC 3 3.10
# 4: DEF 1 1.70
# 5: DEF 2 0.90
# 6: GHI 3 8.20
# 7: JKL 1 1.50
# 8: JKL 2 2.40
# 9: JKL 3 3.13
# 10: JKL 3 8.20
Run Code Online (Sandbox Code Playgroud)
在具有1,000,000个值的数据集上,这将在几秒钟内运行:
dat2 <- rep(dat, 1000000/length(dat))
system.time(out <- tableMaker(dat2))
# user system elapsed
# 4.284 0.000 3.053
out
# ID X Z
# 1: ABC 1 1.50
# 2: ABC 2 2.40
# 3: ABC 3 3.10
# 4: DEF 1 1.70
# 5: DEF 2 0.90
# ---
# 2499996: GHI 3 8.20
# 2499997: JKL 1 1.50
# 2499998: JKL 2 2.40
# 2499999: JKL 3 3.13
# 2500000: JKL 3 8.20
Run Code Online (Sandbox Code Playgroud)
我没有测试@Roland对1M值的处理方法,但这里是对1000个值的比较:
library(microbenchmark)
dat1000 <- rep(dat, 1000/length(dat))
microbenchmark(tableMaker(dat1000), roland(dat1000), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tableMaker(dat1000) 2.346186 2.53734 2.647924 2.573726 2.730987 3.048823 10
# roland(dat1000) 382.857587 391.46409 406.925600 402.442203 412.203468 452.420665 10
Run Code Online (Sandbox Code Playgroud)
这里有10k值:
dat10k <- rep(dat, 10000/length(dat))
microbenchmark(tableMaker(dat10k), roland(dat10k), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tableMaker(dat10k) 19.24391 22.51366 24.57222 23.43996 27.11431 30.54927 5
# roland(dat10k) 6286.45480 6324.42184 6497.16173 6325.32259 6355.39668 7194.21274 5
Run Code Online (Sandbox Code Playgroud)