read.csv比data.table :: fread快

KaZ*_*yKa 2 r fread dataframe data.table

通过网络我可以读到我应该使用data.table和fread来加载我的数据.

但是当我运行基准测试时,我得到以下结果

Unit: milliseconds
expr       min        lq      mean    median        uq        max neval
test1  1.229782  1.280000  1.382249  1.366277  1.460483   1.580176    10
test3  1.294726  1.355139  1.765871  1.391576  1.542041   4.770357    10
test2 23.115503 23.345451 42.307979 25.492186 57.772522 125.941734    10
Run Code Online (Sandbox Code Playgroud)

代码可以在下面看到.

loadpath <- readRDS("paths.rds")

microbenchmark(
  test1 = read.csv(paste0(loadpath,"data.csv"),header=TRUE,sep=";", stringsAsFactors = FALSE,colClasses = "character"),
  test2 = data.table::fread(paste0(loadpath,"data.csv"), sep=";"),
  test3 = read.csv(paste0(loadpath,"data.csv")),
  times = 10
) %>%
  print(order = "min") 
Run Code Online (Sandbox Code Playgroud)

我明白这fread()应该比read.csv()因为它试图首先将行作为字符读入内存然后尝试将它们转换为整数和因子作为数据类型而更快.另一方面,fread()简单地将所有内容都读作字符.

如果这是真的,不test2应该快于test3

有人可以解释我,我为什么不archieve一个加速或者至少相同的速度test2test1?:)

Mau*_*ers 12

data.table::fread如果考虑更大的文件,显着的性能优势就会变得清晰.这是一个完全可重复的例子.

  1. 让我们生成一个由10 ^ 5行和100列组成的CSV文件

    if (!file.exists("test.csv")) {
        set.seed(2017)
        df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5))
        write.csv(df, "test.csv", quote = F)
    }
    
    Run Code Online (Sandbox Code Playgroud)
  2. 我们进行microbenchmark分析(请注意,这可能需要几分钟,具体取决于您的硬件)

    library(microbenchmark)
    res <- microbenchmark(
        read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"),
        fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"),
        times = 10)
    res
    #          Unit: milliseconds
    #     expr        min         lq       mean     median         uq        max
    # read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308
    #    fread   287.1108   311.6304   432.8106   356.6992   460.6167   888.6531
    
    
    library(ggplot2)
    autoplot(res)
    
    Run Code Online (Sandbox Code Playgroud)

在此输入图像描述