RDS 文件比 CSV 文件“更高效”吗？

在使用 R 工作时，我进行了以下非正式观察：

我注意到，与类似大小的 CSV 文件相比，我能够更快地导入 RDS 文件。
例如，假设我的计算机上有一个 CSV 文件。如果我将此 CSV 文件导入到 R 中，请使用“saveRDS”命令将此文件另存为 RDS，然后使用“readRDS”命令重新导入同一文件 - 导入 RDS 版本似乎需要更少的时间该文件的 CSV 版本与同一文件的 CSV 版本进行比较。

例如：

步骤 1我为我的假设创建一个文件

# create file (i.e. imagine this file currently exists on the computer in CSV format)

test_file = data.frame(col1 = sample.int(100, 1000000, replace = TRUE), col2 = sample.int(100, 1000000, replace = TRUE), col3 = sample.int(100, 1000000, replace = TRUE), col4 = sample.int(100, 1000000, replace = TRUE), col5 = sample.int(100, 1000000, replace = TRUE), col6 = sample.int(100, 1000000, replace = TRUE), col7 = sample.int(100, 1000000, replace = TRUE), col8 = sample.int(100, 1000000, replace = TRUE), col9 = sample.int(100, 1000000, replace = TRUE), col10 = sample.int(100, 1000000, replace = TRUE), col11 = sample.int(100, 1000000, replace = TRUE), col12 = sample.int(100, 1000000, replace = TRUE), col13 = sample.int(100, 1000000, replace = TRUE), col14 = sample.int(100, 1000000, replace = TRUE), col15 = sample.int(100, 1000000, replace = TRUE), col16 = sample.int(100, 1000000, replace = TRUE), col17 = sample.int(100, 1000000, replace = TRUE), col18 = sample.int(100, 1000000, replace = TRUE), col19 = sample.int(100, 1000000, replace = TRUE), col20 = sample.int(100, 1000000, replace = TRUE))

Run Code Online (Sandbox Code Playgroud)

第2步：比较导出时间：（RDS更快）

 start.time <- Sys.time()
 write.csv(test_file, "test_file.csv")
 end.time <- Sys.time()
 end.time - start.time

#Time difference of 28.84087 secs

 
 start.time <- Sys.time()
 saveRDS(test_file, "test_file.RDS")
 end.time <- Sys.time()
  end.time - start.time

#Time difference of 11.96845 secs

Run Code Online (Sandbox Code Playgroud)

步骤 3：比较两个文件的大小：RDS 更小（大约是 2 倍）

#I think this in bytes?

> file.info("test_file.csv")$size
[1] 68287349


> file.info("test_file.RDS")$size
[1] 26169028

68287349/26169028
[1] 2.609472

Run Code Online (Sandbox Code Playgroud)

步骤 4：比较导入时间：（RDS 更快）

 start.time <- Sys.time()
 test = read.csv("test_file.csv")
 end.time <- Sys.time()
 end.time - start.time
#Time difference of 8.349364 secs


 start.time <- Sys.time()
 test = readRDS("test_file.RDS")
 end.time <- Sys.time()
 end.time - start.time
#Time difference of 0.59674 secs

Run Code Online (Sandbox Code Playgroud)

根据这些测量结果，RDS 似乎是明显的赢家 - 与 CSV 相比，RDS 导出和导入所需的时间更少，占用的空间也更少。我对此的天真的解释是，RDS 是 R 的“本机文件类型”，因此 RDS 文件可能以某种方式进行编程，以便与 CSV 文件相比，在 R 中自然运行得更快 - 但我对此不确定。

我想知道这个说法是否有任何事实依据（例如，如果我再次重复这个实验，或者在具有不同数据类型的更大/更小的文件上尝试这个实验，结果可能会不同），或者我是否错误进行了这个实验？

PS：将来，我计划在大型数据集上运行机器学习/统计模型 - 在这方面，我想知道将数据集转换为 RDS 是否可能更有利，从而节省时间/资源？

归档时间：	3 年，4 月前
查看次数：	2339 次
最近记录：	3 年，4 月前