当数据不在表格中时,如何将文本文件读入R.

Far*_*rel 5 r

我有一个很长的电话日志作为文本文件,我试图将其读入R但是它确实没有用.文本有一个结构,但肯定不是一个表.其结构如下

  1. 每条记录由多行组成,因此readLines不太合适
  2. 每条记录的每一行都是一个单独的字段
  3. 某些记录在第二个字段后面有一个附加字段
  4. 每个新记录都以空行标注.readLines或者scan如果可以指定记录由"\n \n"分隔并且字段(或列)由"\n"分隔,则可以工作

这是一个例子:

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 12:56
  blay blay blah who knows what, but anyway it may have a comma

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 12:58
  blay blay blah who knows what

TheInstitute 5467
  telephone line 412552999 x 4999
  bump phone line 4125527777
  datetime 2011110516 12:59
  blay blay blah who knows what

TheInstitute 5467
  telephone line 4125526987 x 4567
  bump phone line 4125527777
  datetime 2011110516 13:51
  blay blay blah who knows what, but anyway it may have a comma

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 14:56
  blay blay blah who knows what
Run Code Online (Sandbox Code Playgroud)

我怎么能在R中这样做?我已经尝试了扫描,粘贴,strsplit的技巧,但我在旋转.我可能必须将它放入列表中,因为它可以处理不相等数量的元素.我想让所有记录具有相同数量的字段,对于那些没有一个字段的记录(这里称为凹凸电话),我希望它们只是将NA作为该字段中的值.即使只是开始,我也会感谢帮助.从那里我可以玩和玩具.

42-*_*42- 15

如果scan函数中的multi.line = TRUE,则记录应以两个行尾结束.我在你的文件周围用textConnection做了这个,但你会使用一个有效的文件名:

inp <- scan(textConnection(txt), multi.line=TRUE, 
             what=list(place="character", tline1="character", 
             cline1="character", cline2 ="character", cline3="character"), sep="\n")
Read 5 records
> str(as.data.frame(inp))
'data.frame':   5 obs. of  5 variables:
 $ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1
 $ tline1: Factor w/ 2 levels "  telephone line 4125526987 x 4567",..: 1 1 2 1 1
 $ cline1: Factor w/ 4 levels "  bump phone line 4125527777",..: 2 3 1 1 4
 $ cline2: Factor w/ 4 levels "  blay blay blah who knows what",..: 2 1 3 4 1
 $ cline3: Factor w/ 3 levels "","  blay blay blah who knows what",..: 1 1 2 3 1
> as.data.frame(inp)
              place                             tline1
1 TheInstitute 5467   telephone line 4125526987 x 4567
2 TheInstitute 5467   telephone line 4125526987 x 4567
3 TheInstitute 5467    telephone line 412552999 x 4999
4 TheInstitute 5467   telephone line 4125526987 x 4567
5 TheInstitute 5467   telephone line 4125526987 x 4567
                        cline1
1    datetime 2011110516 12:56
2    datetime 2011110516 12:58
3   bump phone line 4125527777
4   bump phone line 4125527777
5    datetime 2011110516 14:56
                                                           cline2
1   blay blay blah who knows what, but anyway it may have a comma
2                                   blay blay blah who knows what
3                                       datetime 2011110516 12:59
4                                       datetime 2011110516 13:51
5                                   blay blay blah who knows what
                                                           cline3
1                                                                
2                                                                
3                                   blay blay blah who knows what
4   blay blay blah who knows what, but anyway it may have a comma
5                                                                
Run Code Online (Sandbox Code Playgroud)