R中data.table包中fread速度背后的原因

Vij*_*jay 19 performance r fread data.table

我对大型数据文件中的fread函数速度感到惊讶,data.table但它如何能够如此快速地读取?fread和之间的基本实现差异是read.csv什么?

Mat*_*wle 29

我认为,我们比较read.csv有应用,如设置所有已知的建议colClasses,nrows等等.read.csv(filename)没有任何其他的参数主要是缓慢的,因为它首先读取一切到内存就好像它是character,然后试图强迫,要integernumeric作为第二步.

因此,相比较freadread.csv(filename, colClasses=, nrows=, etc)...

它们都是用C语言编写的,所以不是这样.

特别是没有一个原因,但实际上,fread内存将文件映射到内存中,然后使用指针迭代文件.而是read.csv通过连接将文件读入缓冲区.

如果您运行freadverbose=TRUE它会告诉你它是如何工作和报告中的每个步骤所花费的时间.例如,请注意它直接跳到文件的中间和末尾以更好地猜测列类型(尽管在这种情况下前5个就足够了).

> fread("test.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes (   first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+   last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
  13.420s ( 31%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   3.210s (  7%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   1.310s (  3%) Allocation of 10000000x6 result (xMB) in RAM
  25.580s ( 59%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.040s (  0%) Changing na.strings to NA
  43.560s        Total
Run Code Online (Sandbox Code Playgroud)

注意:我的速度非常慢的上网本没有固态硬盘.每个步骤的绝对时间和相对时间在机器之间会有很大差异.例如,如果您再次重新运行,fread您可能会注意到mmap的时间要少得多,因为您的操作系统已从之前的运行中缓存了它.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000         # i.e. my slow netbook
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1
Run Code Online (Sandbox Code Playgroud)

  • 为了记录@ Sandeep的评论有点过时,因为现在`fread`支持`encoding`参数 (4认同)
  • 对我们来说很明显!= 对每个人都很明显!= 正确。我不是在建议关于 `fread()` 的任何内容。 (2认同)