如何使用fread()作为readLines()而不使用自动列检测?

Eri*_*ang 12 r data.table

我有一个5Gb .dat文件(> 1000万行).每行的格式类似于aaaa bb cccc0123 xxx kkkkkkkkkkkkkk或者aaaaabbbcccc01234xxxkkkkkkkkkkkkkk例如.因为readLines在读取大文件时性能不佳,我选择fread()阅读此内容,但发生了错误:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") : 
  Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
  Unable to find 5 lines with expected number of columns (+ middle)
Run Code Online (Sandbox Code Playgroud)

如何使用fread()readLines()无自动列检测?或者还有其他方法可以解决这个问题吗?

Ric*_*ven 22

这是一个技巧.您可以使用sep您知道不在文件中的值.这样做会强制fread()将整行读作单列.然后我们可以将该列放到原子矢量中(如下[[1L]]所示).这是我?用作csv的一个例子sep.这种方式类似readLines(),但速度要快得多.

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
Run Code Online (Sandbox Code Playgroud)

您可以尝试的其他不常见的角色sep\ ^ @ # =和其他人.我们可以看到这将产生相同的输出readLines().这只是找到sep文件中不存在的值的问题.

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,," 
Run Code Online (Sandbox Code Playgroud)

注意:正如@Cath在评论中提到的那样,您也可以简单地使用换行符\n作为sep值.

  • 这应该被大力支持.很好的技巧,实际上在我的情况下使用sep ='〜'. (2认同)