Vas*_*y A 4 performance r line-endings data.table
我在文本文件中有制表符分隔的表,其中所有行以\r\r\n(0x0D 0x0D 0x0A)结尾.如果我试着阅读这样的文件fread(),它说
行结尾是\ r \n\r \n.R的download.file()似乎在Windows上以文本模式添加额外的\ r \n.请以二进制模式(mode ='wb')再次下载,这也可能更快.或者,将URL直接传递给fread,它将以二进制模式为您下载文件.
但我没有下载这些文件,我已经拥有它们了.
到目前为止,我找到了首先读取文件的解决方案read.table()(它将\r\r\n组合视为单个行尾字符),然后转换data.frame为data.table():
mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))
Run Code Online (Sandbox Code Playgroud)
但我想知道是否有任何方法可以避免慢速read.table()和快速使用fread().
我建议使用GNU实用程序tr来摆脱那些不必要的\r字符.例如
cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") :
## Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.
system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
## a b c
## 1: 1 2 3
## 2: 4 5 6
Run Code Online (Sandbox Code Playgroud)
如果您使用的是Windows并且没有该tr实用程序,则可以在此处获取.
添加:
我使用100,000 x 5样本cvs数据集对三种方法进行了一些比较.
OPcsv是"慢"的read.table方法 freadScan是一种丢弃\r纯R中额外字符 的方法freadtr直接tr使用shell 调用GNU fread(). 第三种方法是迄今为止最快的方法.
# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
sample.txt <- paste0(sample.txt,
paste(round(runif(5)*100), collapse = ","),
delim)
}
cat(sample.txt, file = "sample.csv")
# function that translates the extra \r characters in R only
fread2 <- function(filename) {
tmp <- scan(file = filename, what = "character", quiet = TRUE)
# remove empty lines caused by \r
tmp <- tmp[tmp != ""]
# paste lines back together together with \n character
tmp <- paste(tmp, collapse = "\n")
fread(tmp)
}
# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))
require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
freadScan = fread2("sample.csv"),
freadtr = fread("tr -d \'\\r\' < sample.csv"),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223 100
## freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434 100
## freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
Run Code Online (Sandbox Code Playgroud)