加快write.table的性能

lol*_*ity 42 r

我有一个data.frame,我想写出来.我的尺寸data.frame是256行乘65536列.什么是更快的替代品write.csv

Mic*_*ico 65

data.table::fwrite()由Otto Seiskari提供,版本1.9.8+.Matt在顶部进行了额外的增强(包括并行化),并写了一篇关于它的文章.请在跟踪器上报告任何问题.

首先,这里是对@chase上面使用的相同维度的比较(即,非常多的列:65,000列(!) x 256行),以及fwritewrite_feather,以便我们在机器之间具有一定的一致性.注意compress=FALSE基础R 的巨大差异.

# -----------------------------------------------------------------------------
# function  | object type |  output type | compress= | Runtime | File size |
# -----------------------------------------------------------------------------
# save      |      matrix |    binary    |   FALSE   |    0.3s |    134MB  |
# save      |  data.frame |    binary    |   FALSE   |    0.4s |    135MB  |
# feather   |  data.frame |    binary    |   FALSE   |    0.4s |    139MB  |
# fwrite    |  data.table |    csv       |   FALSE   |    1.0s |    302MB  |
# save      |      matrix |    binary    |   TRUE    |   17.9s |     89MB  |
# save      |  data.frame |    binary    |   TRUE    |   18.1s |     89MB  |
# write.csv |      matrix |    csv       |   FALSE   |   21.7s |    302MB  |
# write.csv |  data.frame |    csv       |   FALSE   |  121.3s |    302MB  |
Run Code Online (Sandbox Code Playgroud)

请注意,fwrite()并行运行.这里显示的时间是13英寸Macbook Pro,带有2个内核和1个线程/核心(通过超线程实现+2个虚拟线程),512GB SSD,256KB /核心L2缓存和4MB L4缓存.根据您的系统规格,YMMV.

我还重新考虑了相对更可能(和更大)数据的基准:

library(data.table)
NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine
set.seed(51423)
DT <- data.table(
  str1 = sample(sprintf("%010d",1:NN)), #ID field 1
  str2 = sample(sprintf("%09d",1:NN)),  #ID field 2
  # varying length string field--think names/addresses, etc.
  str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
  # factor-like string field with 50 "levels"
  str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)),
  # factor-like string field with 17 levels, varying length
  str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T),
      collapse="")),NN,T),
  # lognormally distributed numeric
  num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2),
  # 3 binary strings
  str6 = sample(c("Y","N"),NN,T),
  str7 = sample(c("M","F"),NN,T),
  str8 = sample(c("B","W"),NN,T),
  # right-skewed (integer type)
  int1 = as.integer(ceiling(rexp(NN))),
  num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2),
  # lognormal numeric that can be positive or negative
  num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2))

# -------------------------------------------------------------------------------
# function  |   object   | out |        other args         | Runtime  | File size |
# -------------------------------------------------------------------------------
# fwrite    | data.table | csv |      quote = FALSE        |   1.7s   |  523.2MB  |
# fwrite    | data.frame | csv |      quote = FALSE        |   1.7s   |  523.2MB  |
# feather   | data.frame | bin |     no compression        |   3.3s   |  635.3MB  |
# save      | data.frame | bin |     compress = FALSE      |  12.0s   |  795.3MB  |
# write.csv | data.frame | csv |    row.names = FALSE      |  28.7s   |  493.7MB  |
# save      | data.frame | bin |     compress = TRUE       |  48.1s   |  190.3MB  |
# -------------------------------------------------------------------------------
Run Code Online (Sandbox Code Playgroud)

所以fwritefeather这个测试快2倍.这是在如上所述的同一台机器上fwrite运行,并在2个核心上并行运行.

feather 似乎也是非常快的二进制格式,但还没有压缩.


这是尝试展示如何fwrite比较规模:

注:基准已运行更新基础R的save()使用compress = FALSE(因为羽毛也不会被压缩).

MB

所以fwrite就是所有这些关于该数据(2个核心上运行)的最快加上它创建了一个.csv可以很容易被观看,检查和传递到grep,sed

复制代码:

require(data.table)
require(microbenchmark)
require(feather)
ns <- as.integer(10^seq(2, 6, length.out = 25))
DTn <- function(nn)
    data.table(
          str1 = sample(sprintf("%010d",1:nn)),
          str2 = sample(sprintf("%09d",1:nn)),
          str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
          str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)),
          str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T),
          num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2),
          str6 = sample(c("Y","N"),nn,T),
          str7 = sample(c("M","F"),nn,T),
          str8 = sample(c("B","W"),nn,T),
          int1 = as.integer(ceiling(rexp(nn))),
          num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2),
          num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2))

count <- data.table(n = ns,
                    c = c(rep(1000, 12),
                          rep(100, 6),
                          rep(10, 7)))

mbs <- lapply(ns, function(nn){
  print(nn)
  set.seed(51423)
  DT <- DTn(nn)
  microbenchmark(times = count[n==nn,c],
               write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE),
               save=save(DT, file = "save.RData", compress=FALSE),
               fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","),
               feather=write_feather(DT, "feather.feather"))})

png("microbenchmark.png", height=600, width=600)
par(las=2, oma = c(1, 0, 0, 0))
matplot(ns, t(sapply(mbs, function(x) {
  y <- summary(x)[,"median"]
  y/y[3]})),
  main = "Relative Speed of fwrite (turbo) vs. rest",
  xlab = "", ylab = "Time Relative to fwrite (turbo)",
  type = "l", lty = 1, lwd = 2, 
  col = c("red", "blue", "black", "magenta"), xaxt = "n", 
  ylim=c(0,25), xlim=c(0, max(ns)))
axis(1, at = ns, labels = prettyNum(ns, ","))
mtext("# Rows", side = 1, las = 1, line = 5)
legend("right", lty = 1, lwd = 3, 
       legend = c("write.csv", "save", "feather"),
       col = c("red", "blue", "magenta"))
dev.off()
Run Code Online (Sandbox Code Playgroud)

  • @DmitriySelivanov快速测试,`write_csv`比`write.csv`慢... (3认同)
  • 我也认为`save()`正确写入/读取类`Date`的列,而`fwrite()`和`feather()`当前没有.所以一个公平的比较就是单独反对`double`,`char`和`integer`类型. (3认同)
  • 那么`readr :: write_csv`呢?将它添加到基准测试会很不错. (2认同)

Cha*_*ase 25

如果所有列都属于同一类,则在写出之前转换为矩阵,提供近6倍的加速.此外,您可以查看write.matrix()从包中使用MASS,但这个例子并没有证明更快.也许我没有正确设置:

#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
#   user  system elapsed 
# 319.53   13.65  333.76 

#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
#   user  system elapsed 
#  52.43    0.88   53.59 

#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
#   user  system elapsed 
# 113.58   59.12  172.75 
Run Code Online (Sandbox Code Playgroud)

编辑

为了解决下面提出的问题,上面的结果对data.frame不公平,这里有一些结果和时间表明整个消息仍然"如果可能的话将数据对象转换为矩阵.如果不可能,请处理或者,重新考虑为什么你需要以CSV格式写出一个200MB +的文件,如果时间是最重要的":

#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
#   user  system elapsed 
# 317.85   13.95  332.44
#This even includes the overhead in converting to as.matrix in the timing 
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
#   user  system elapsed 
#  53.67    0.92   54.67 
Run Code Online (Sandbox Code Playgroud)

所以,没有什么真正改变.要确认这是合理的,请考虑以下方面的相对时间成本as.data.frame():

m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
#   user  system elapsed 
#   0.77    0.00    0.77 
Run Code Online (Sandbox Code Playgroud)

因此,与下面的评论相信,并不是真正的大问题或倾斜信息.如果您仍然不相信使用write.csv()大型数据框架在性能方面是个坏主意,请参阅以下手册Note:

write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.
Run Code Online (Sandbox Code Playgroud)

最后,如果您仍然因为更快地保存事物而失眠,请考虑转移到本机RData对象

system.time(save(m2, file = "thisisfast.RData"))
#   user  system elapsed 
#  21.67    0.12   21.81
Run Code Online (Sandbox Code Playgroud)

  • 这有点不公平的比较...... as.data.frame需要相当长的时间.此外,OP具有的数据已经在data.frame中. (3认同)
  • 在最后的`system.time(save(...))中,添加`compress = FALSE`的速度要快得多.在我的机器上14秒vs 0.2秒. (2认同)

had*_*ley 12

另一种选择是使用羽毛文件格式.

df <- as.data.frame(matrix(runif(256*65536), nrow = 256))

system.time(feather::write_feather(df, "df.feather"))
#>   user  system elapsed 
#>  0.237   0.355   0.617 
Run Code Online (Sandbox Code Playgroud)

Feather是一种二进制文件格式,旨在提高读写效率.它设计用于多种语言:目前有R和python客户端,julia客户端正在开发中.

为了比较,这saveRDS需要多长时间:

system.time(saveRDS(df, "df.rds"))
#>   user  system elapsed 
#> 17.363   0.307  17.856
Run Code Online (Sandbox Code Playgroud)

现在,这是一个有点不公平的比较,因为默认saveRDS是压缩数据,这里数据是不可压缩的,因为它是完全随机的.关闭压缩会saveRDS显着加快速度:

system.time(saveRDS(df, "df.rds", compress = FALSE))
#>   user  system elapsed 
#>  0.181   0.247   0.473     
Run Code Online (Sandbox Code Playgroud)

事实上它现在比羽毛快一点.那么为什么要用羽毛?嗯,它通常比它快readRDS(),并且与读取它的次数相比,通常写入的数据相对较少.

system.time(readRDS("df.rds"))
#>   user  system elapsed 
#>  0.198   0.090   0.287 

system.time(feather::read_feather("df.feather"))
#>   user  system elapsed 
#>  0.125   0.060   0.185 
Run Code Online (Sandbox Code Playgroud)

  • `feather`很棒,但与原始问题无关.因为它是二进制格式...... (7认同)
  • 注意`saveRDS`需要`compress = FALSE`. (6认同)
  • @DmitriySelivanov我刚刚重新阅读了原始问题,但我没有看到它在哪里请求纯文本格式. (4认同)
  • 请参阅https://gist.github.com/markdanese/28b9f5412df55efceba754fee2363444,了解任何想要测试它的人的要点.FWIW,fwrite对于CSV很快,但与羽毛不在同一个联盟. (3认同)