Eri*_*ang 12 r data.table
我有一个5Gb .dat文件(> 1000万行).每行的格式类似于aaaa bb cccc0123 xxx kkkkkkkkkkkkkk
或者aaaaabbbcccc01234xxxkkkkkkkkkkkkkk
例如.因为readLines
在读取大文件时性能不佳,我选择fread()
阅读此内容,但发生了错误:
library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") :
Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
Unable to find 5 lines with expected number of columns (+ middle)
Run Code Online (Sandbox Code Playgroud)
如何使用fread()
如readLines()
无自动列检测?或者还有其他方法可以解决这个问题吗?
Ric*_*ven 22
这是一个技巧.您可以使用sep
您知道不在文件中的值.这样做会强制fread()
将整行读作单列.然后我们可以将该列放到原子矢量中(如下[[1L]]
所示).这是我?
用作csv的一个例子sep
.这种方式类似readLines()
,但速度要快得多.
f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
Run Code Online (Sandbox Code Playgroud)
您可以尝试的其他不常见的角色sep
是\ ^ @ # =
和其他人.我们可以看到这将产生相同的输出readLines()
.这只是找到sep
文件中不存在的值的问题.
head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
Run Code Online (Sandbox Code Playgroud)
注意:正如@Cath在评论中提到的那样,您也可以简单地使用换行符\n
作为sep
值.