将大数据帧拆分为较小的段

Question

将大数据帧拆分为较小的段

我有以下数据框,我想将其分解为10个不同的数据框.我想将最初的100行数据帧分成10行10个数据帧.我可以做以下事情并获得理想的结果.

df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), three=c(rnorm(100)))

df1 = df[1:10,]
df2 = df[11:20,]
df3 = df[21:30,]
df4 = df[31:40,]
df5 = df[41:50,]
...

Run Code Online (Sandbox Code Playgroud)

当然,当初始数据帧较大或者没有可以分解的简单数量的段时,这不是执行此任务的优雅方式.

因此,鉴于上述情况,我们假设我们有以下数据框架.

df = data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))

Run Code Online (Sandbox Code Playgroud)

现在我想将它拆分为由200行组成的新数据帧,以及包含剩余行的最终数据帧.什么是更优雅(也称为"快速")方式来执行此任务.

Answer 1

42-*_*42- 30

 > str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
 $ 0:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
  ..$ two  : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
  ..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
 $ 1:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
  ..$ two  : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
  ..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
 $ 2:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
  ..$ two  : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
  ..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
 $ 3:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
  ..$ two  : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
  ..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
 $ 4:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
  ..$ two  : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
  ..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
 $ 5:'data.frame':  123 obs. of  3 variables:
  ..$ one  : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
  ..$ two  : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
  ..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...

Run Code Online (Sandbox Code Playgroud)

如果某些代码可能更改了rownames,则使用起来会更安全:

 split(df, (seq(nrow(df))-1) %/% 200)

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 6

require(ff)
df <- data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
for(i in chunk(from = 1, to = nrow(df), by = 200)){
  print(df[min(i):max(i), ])
}

Run Code Online (Sandbox Code Playgroud)

Answer 3

jor*_*ran 5

如果您可以生成定义组的向量，则可以split：

f <- rep(seq_len(ceiling(1123 / 200)),each = 200,length.out = 1123)
> df1 <- split(df,f = f)
> lapply(df1,dim)
$`1`
[1] 200   3

$`2`
[1] 200   3

$`3`
[1] 200   3

$`4`
[1] 200   3

$`5`
[1] 200   3

$`6`
[1] 123   3

Run Code Online (Sandbox Code Playgroud)

Answer 4

Sam*_*Sam 5

将 df 分成 100 万个行组，并在 SQL 中一次推送并追加 100 万个到 df

batchsize = 1000000 # vary to your liking

# cycles through data by batchsize
for (i in 1:ceiling(nrow(df)/batchsize)) 
 {
print(i) # just to show the progress

# below shows how to cycle through data 
batch <- df[(((i-1)*batchsize)+1(batchsize*i),,drop=FALSE] # drop = FALSE keeps it from being converted to a vector 

# if below not done then the last batch has Nulls above the number of rows of actual data
batch <- batch[!is.na(batch$ID),] # ID is a variable I presume is in every row

#in this case the table already existed, if new table overwrite = TRUE
(dbWriteTable(con, "df", batch, append = TRUE,row.names = FALSE)) 
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，8 月前
查看次数：	15500 次
最近记录：	7 年，4 月前