使用包data.table中的fread一次读取块

Question

使用包data.table中的fread一次读取块

我正在尝试使用fread包中的函数输入一个大的制表符分隔文件(大约2GB)data.table.但是,因为它太大了,所以它并不完全适合内存.我试图通过使用skip和nrow参数来输入它,例如:

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

Run Code Online (Sandbox Code Playgroud)

在上面的例子中,我一次读取100万行,对它们进行计算,然后得到下一百万行等.这段代码的问题是在检索fread到每个块后,需要开始扫描文件从一开始就经过每次循环迭代后,skip增加了一百万.结果,在每个块之后,fread实际到达下一个块需要更长和更长时间,这使得效率非常低.

有没有办法告诉你fread暂停每一个说100万行,然后继续阅读从那一点开始而不必重新开始？任何解决方案,还是应该是新功能请求？

Answer 1

use*_*672 15

你应该使用这个LaF包.这为您的数据引入了一种指针,从而避免了 - 对于非常大的数据 - 读取整个文件的恼人行为.据我所知,fread()在data.tablepckg中需要知道总行数,这需要GB数据的时间.使用指针LaF可以转到你想要的每一行; 并读取您可以应用函数的数据块,然后继续下一个数据块.在我的小型PC上,我以10e6行的步长运行了一个25 GB的csv文件,并提取了所需的全部〜5e6个观测值 - 每个10e6块需要30秒.

更新:

library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'

#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)

Run Code Online (Sandbox Code Playgroud)

然后使用模型创建与文件的连接:

df.laf <- laf_open(model)

Run Code Online (Sandbox Code Playgroud)

完成后,您可以执行各种操作,而无需像data.table pckgs那样了解文件的大小.例如,将指针放在第100e6行并从这里读取1e6行数据:

goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)

Run Code Online (Sandbox Code Playgroud)

现在data包含CSV文件的1e6行(从第100e6行开始).

您可以读取数据块(大小取决于您的内存),只保留您需要的内容.例如,huge_file在我的例子中指向一个包含所有已知蛋白质序列的文件,其大小> 27 GB - 对我的PC来说很大.为了得到只有人类序列,我使用生物体id过滤了人类9606,这应该出现在变量的开头protein1.一种肮脏的方式是将它放入一个简单的for循环中,然后一次只读取一个数据块:

library('dplyr')
library('stringr')

res <- df.laf[1,][0,]
for(i in 1:10){
  raw <-
    next_block(df.laf,nrows=100e6) %>% 
    filter(str_detect(protein1,"^9606\\."))
  res <- rbind(res, raw)

    }

Run Code Online (Sandbox Code Playgroud)

现在res包含过滤的人类数据.但更好 - 对于更复杂的操作,例如在运行中对数据进行计算 - 该函数process_blocks()将参数作为参数.因此,在该功能中,您可以根据需要对每个数据进行操作.阅读文档.

谢谢你。我有一个 872493862 行 61GB 文件，它运行得相当快。我尝试使用“nrows”和“skip”与 fread() 进行相同的循环方法，但它在每个循环中变得越来越慢，因为它必须跳过更多行。 (2认同)

Answer 2

Ren*_*rop 8

您可以使用readr read_*_chunked来读取数据,例如以chunkwise方式对其进行过滤.请看这里和这里的例子:

# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)

Run Code Online (Sandbox Code Playgroud)

Answer 3

小智 7

fread()绝对可以帮助你分块读取数据

您在代码中犯的错误是，在循环期间nrow更改函数中参数的大小时，应该保持常量。skip

我为数据编写的内容如下：

data=NULL

for (i in 0:20){
    data[[i+1]]=fread("my_data.csv",nrow=10000,select=c(1,2:100),skip =10000*i)   
}

Run Code Online (Sandbox Code Playgroud)

您可以在循环中插入以下代码：

start_time <- Sys.time()
#####something!!!!

end_time <- Sys.time()

end_time - start_time

Run Code Online (Sandbox Code Playgroud)

检查时间——每个循环平均花费相似的时间。

然后，您可以使用另一个循环将数据按行与rbindR 中的默认函数组合起来。

示例代码可能是这样的：

new_data = data[[1]]

for (i in 1:20){
    new_data=rbind(new_data,data[[i+1]],use.names=FALSE)
}

Run Code Online (Sandbox Code Playgroud)

统一成一个大数据集。

希望我的回答可以对您的问题有所帮助。

我使用此方法在大约 8 分钟内加载了包含 2k+ 列、200k 行的 18Gb 数据。

Answer 4

Ben*_*Ben 5

一个相关的选项是分块软件包。这是一个具有3.5 GB文本文件的示例：

library(chunked)
library(tidyverse)

# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this: 

my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'

# How big is my file?
print(paste(round(file.info(my_file)$size  / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor

# look at the top of the file 
readLines(my_file, n = 100)

# to find where the content starts, vary the skip value, 
read.table(my_file, nrows = 10, skip = 25)

Run Code Online (Sandbox Code Playgroud)

这是我们开始处理文件块的地方，我们可以按通常方式使用大多数dplyr动词：

# Let the chunked pkg work its magic! We only want the lines containing 
# "Gun_control". The main challenge here was identifying the column
# header
df <- 
read_chunkwise(my_file, 
               chunk_size=5000,
               skip = 30,
               format = "table",
               header = TRUE) %>% 
  filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))

# this line does the evaluation, 
# and takes a few moments...
system.time(out <- collect(df))

Run Code Online (Sandbox Code Playgroud)

在这里我们可以照常处理输出，因为它比输入文件小得多：

# clean up the output to separate into cols, 
# and get the number of page views as a numeric
out_df <- 
out %>% 
  separate(De.mw.De.5.J3M1O1, 
           into = str_glue("V{1:4}"),
           sep = " ") %>% 
  mutate(V3 = as.numeric(V3))

 head(out_df)
    V1                                                        V2   V3
1 en.z                                               Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z          Gun_control_policy_of_the_Clinton_Administration  223
4 en.z                            Category:Gun_control_advocates   80
5 en.z                         Gun_control_in_the_United_Kingdom   68
6 en.z                                    Gun_control_in_america   59
                                                                                 V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2                                     B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3                                     A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4                                                            D2H1M1S4T8U22V10W18X14
5                                                             B1C1S1T11U12V13W16X13
6                                                         B1H1M1N2P1S1T6U5V17W12X12

#--------------------

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，2 月前
查看次数：	5546 次
最近记录：	7 年，10 月前