ATM*_*hew 20 split r dataframe
我有以下数据框,我想将其分解为10个不同的数据框.我想将最初的100行数据帧分成10行10个数据帧.我可以做以下事情并获得理想的结果.
df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), three=c(rnorm(100)))
df1 = df[1:10,]
df2 = df[11:20,]
df3 = df[21:30,]
df4 = df[31:40,]
df5 = df[41:50,]
...
Run Code Online (Sandbox Code Playgroud)
当然,当初始数据帧较大或者没有可以分解的简单数量的段时,这不是执行此任务的优雅方式.
因此,鉴于上述情况,我们假设我们有以下数据框架.
df = data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
Run Code Online (Sandbox Code Playgroud)
现在我想将它拆分为由200行组成的新数据帧,以及包含剩余行的最终数据帧.什么是更优雅(也称为"快速")方式来执行此任务.
42-*_*42- 30
> str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
$ 0:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
..$ two : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
$ 1:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
..$ two : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
$ 2:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
..$ two : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
$ 3:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
..$ two : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
$ 4:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
..$ two : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
$ 5:'data.frame': 123 obs. of 3 variables:
..$ one : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
..$ two : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...
Run Code Online (Sandbox Code Playgroud)
如果某些代码可能更改了rownames,则使用起来会更安全:
split(df, (seq(nrow(df))-1) %/% 200)
Run Code Online (Sandbox Code Playgroud)
小智 6
require(ff)
df <- data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
for(i in chunk(from = 1, to = nrow(df), by = 200)){
print(df[min(i):max(i), ])
}
Run Code Online (Sandbox Code Playgroud)
如果您可以生成定义组的向量,则可以split
:
f <- rep(seq_len(ceiling(1123 / 200)),each = 200,length.out = 1123)
> df1 <- split(df,f = f)
> lapply(df1,dim)
$`1`
[1] 200 3
$`2`
[1] 200 3
$`3`
[1] 200 3
$`4`
[1] 200 3
$`5`
[1] 200 3
$`6`
[1] 123 3
Run Code Online (Sandbox Code Playgroud)
batchsize = 1000000 # vary to your liking
# cycles through data by batchsize
for (i in 1:ceiling(nrow(df)/batchsize))
{
print(i) # just to show the progress
# below shows how to cycle through data
batch <- df[(((i-1)*batchsize)+1(batchsize*i),,drop=FALSE] # drop = FALSE keeps it from being converted to a vector
# if below not done then the last batch has Nulls above the number of rows of actual data
batch <- batch[!is.na(batch$ID),] # ID is a variable I presume is in every row
#in this case the table already existed, if new table overwrite = TRUE
(dbWriteTable(con, "df", batch, append = TRUE,row.names = FALSE))
}
Run Code Online (Sandbox Code Playgroud)