R阅读巨大的csv

use*_*622 17 windows csv ram r

我有一个巨大的csv文件.它的大小约为9 GB.我有16 gb的ram.我按照页面上的建议进行操作并在下面实现.

If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field: 
--max-vsize=500M 
Run Code Online (Sandbox Code Playgroud)

我仍然收到下面的错误和警告.我应该如何将9 gb的文件读入我的R?我有R 64位3.3.1,我在rstudio 0.99.903中运行命令.我有Windows Server 2012 r2标准,64位操作系统.

> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
Run Code Online (Sandbox Code Playgroud)

------------------- Update1

我的第一次尝试基于建议的答案

> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)
Run Code Online (Sandbox Code Playgroud)

------------------- Update2

我的第二次尝试基于建议的答案如下

thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE, 
+                    first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)
Run Code Online (Sandbox Code Playgroud)

我怎样才能将这个文件读入一个对象,这样我就可以一次性分析整个数据

------------------更新3

我们买了一台昂贵的机器.它有10个内核和256 GB RAM.这不是最有效的解决方案,但它至少在不久的将来有效.我看了下面的答案,我不认为他们解决了我的问题:(我很欣赏这些答案.我想进行市场篮子分析,我不认为没有别的办法,而不是把我的数据保存在RAM中

Hac*_*k-R 17

确保您使用的是64位R,而不仅仅是64位Windows,这样您就可以将RAM分配增加到所有16 GB.

此外,您可以在块中读取文件:

file_in    <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x          <- readLines(file_in, n=chunk_size)
Run Code Online (Sandbox Code Playgroud)

您可以使用data.table更有效地处理读取和操作大文件:

require(data.table)
fread("in.csv", header = T)
Run Code Online (Sandbox Code Playgroud)

如果需要,您可以利用存储内存ff:

library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, 
                   first.rows=10000, next.rows=50000, colClasses=NA)
Run Code Online (Sandbox Code Playgroud)

  • @ user2543622使用`ff`.但仅仅是为了记录大块文件分块是大数据的标准做法.另一个答案是您可以先在SQL中预处理数据.也许一旦你在R中获得它,你也可以将它的一些发送到稀疏矩阵. (2认同)

Jon*_*oll 9

您可能需要考虑利用一些磁盘上的处理,而不是在R的内存中拥有整个对象.一种选择是将数据存储在适当的数据库中,然后具有R访问权限.dplyr能够处理远程源(它实际上写SQL语句来查询数据库).我刚用一个小例子(仅17,500行)测试了这个,但希望它可以扩展到你的要求.

安装SQLite

https://www.sqlite.org/download.html

将数据输入新的SQLite数据库

  • 将以下内容保存在名为的新文件中 import.sql

CREATE TABLE tableName (COL1, COL2, COL3, COL4); .separator , .import YOURDATA.csv tableName

是的,您需要自己指定列名称(我相信),但如果您愿意,也可以在此处指定其类型.如果你的名字/数据中的任何地方都有逗号,那么这将不起作用.

  • 通过命令行将数据导入SQLite数据库

sqlite3.exe BIGDATA.sqlite3 < import.sql

指向dplyrSQLite数据库

当我们使用SQLite时,所有依赖项都dplyr已经处理完毕.

library(dplyr) my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE) my_tbl <- tbl(my_db, "tableName")

做你的探索性分析

dplyr将编写查询此数据源所需的SQLite命令.否则它将表现得像本地表.最大的例外是您无法查询行数.

my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))

#>  Source:   query [?? x 2]
#>  Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>  
#>         COL2    meanVal
#>        <chr>      <dbl>
#>  1      1979   15.26476
#>  2      1980   16.09677
#>  3      1981   15.83936
#>  4      1982   14.47380
#>  5      1983   15.36479
Run Code Online (Sandbox Code Playgroud)


Chr*_*ris 5

这可能无法在您的计算机上进行.在某些情况下,data.table占用的空间比.csv对应的空间大.

DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 40001072 bytes #40 MB
Run Code Online (Sandbox Code Playgroud)

两个OOM更大:

DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 4000001072 bytes #4.00 GB
Run Code Online (Sandbox Code Playgroud)

在R中存储对象存在自然的开销.基于这些数字,在读取文件时大约有1.33因子,但是,这取决于数据.例如,使用

  • x = sample(1:10000000,10000000,replace = T) 给出一个大约2倍的因子(R:csv).

  • x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) 给出0.5倍(R:csv)的因子.

基于最大值,你的9GB文件将占用18GB的潜在内存存储在R中,如果不是更多的话.根据您的错误消息,您更有可能遇到硬内存限制与分配问题.因此,只需在chucks中读取文件并进行整合就行不通了 - 您还需要对分析+工作流进行分区.另一种方法是使用像SQL这样的内存工具.