Rdata.table::fread非常适合这一点。让我们编写一个示例文件:
library(data.table)
set.seed(39439)
NN = 3e8
DT = data.table(
ID1 = sample(LETTERS, NN, TRUE),
ID2 = sample(letters, NN, TRUE),
V1 = rnorm(NN)
)
DT
# ID1 ID2 V1
# 1: O h 0.1580064
# 2: K l -2.4281532
# 3: F z 1.7353759
# 4: B f -1.0911407
# 5: M w 0.7187998
# ---
# 299999996: D u -0.8221716
# 299999997: F f -2.4881300
# 299999998: W t 0.0371132
# 299999999: I h -1.2020380
# 300000000: L s -2.2284455
# smaller than your data, but still large
format(object.size(DT), 'Gb')
# [1] "6.7 Gb"
# write to test file
fwrite(DT, tmp <- tempfile())
# size on disk about the same
file.info(tmp)$size/1024^3
# [1] 6.191435
Run Code Online (Sandbox Code Playgroud)
两个选项:(1)在R中读取然后过滤:
rm(DT)
system.time({
DT = fread(tmp)
DT = DT[ID2 == 'a']
})
# user system elapsed
# 50.390 25.662 40.004
Run Code Online (Sandbox Code Playgroud)
约 40 秒
(2) 使用awk过滤,然后阅读:
rm(DT)
system.time({
DT = fread(cmd = paste('awk -F, \'$2 == "a"\'', tmp))
})
# user system elapsed
# 350.170 3.775 354.638
Run Code Online (Sandbox Code Playgroud)
后者要慢得多,因为前者并行运行。优点是第一种方法内存效率不高——您首先占用整个文件的所有内存,然后再过滤到较小的表。该awk方法只将过滤后的文件加载到内存中。
(2*) 在这种情况下,您实际上也可以使用grep,但请注意,这仅适用,因为a此文件中只能包含一列:
rm(DT)
system.time({
DT = fread(cmd = paste('grep -F ",a,"', tmp))
})
# user system elapsed
# 164.587 2.500 167.165
Run Code Online (Sandbox Code Playgroud)
PS 注意“标价” vroom——如前所述,它只索引您的数据,因此比较仅读取数据的时间可能会产生误导——您必须计算实际处理数据所需的时间,因为那样触发数据加载。这是一个比较:
# to offset some re-reading optimizations in fread
file.copy(tmp, tmp <- tempfile())
rm(DT)
system.time({
DT = fread(tmp)
DT = DT[ID2 == 'a']
DT[ , .(mean(V1)), by = .(ID1, ID2)]
})
# user system elapsed
# 61.930 31.740 52.958
library(dplyr)
rm(DT)
system.time({
DT = vroom::vroom(tmp)
DT = DT %>% filter(ID2 == 'a')
DT %>% group_by(ID1, ID2) %>% summarize(mean(V1))
})
# user system elapsed
# 122.605 56.562 129.957
Run Code Online (Sandbox Code Playgroud)
(跳过第三步的比较大致相同)