一种过滤文本文件的算法

Question

一种过滤文本文件的算法

想象一下,你有一个.txt以下结构的文件:

>>> header
>>> header
>>> header
K L M
200 0.1 1
201 0.8 1
202 0.01 3
...
800 0.4 2
>>> end of file
50 0.1 1
75 0.78 5
...

Run Code Online (Sandbox Code Playgroud)

我想读取除了>>>行所示的>>> end of file行和行下面的行之外的所有数据.到目前为止,我已经使用read.table(comment.char = ">", skip = x, nrow = y)(x并且y目前已修复)解决了这个问题.这将读取标题和之间的数据>>> end of file.

但是,我想让我的功能在行数上更加可塑.数据的值可能大于800,因此行数更多.

我可以scan或readLines文件,看看哪一行对应,>>> end of file并计算要读取的行数.你会用什么方法？

Answer 1

Rei*_*son 11

这是一种方法:

Lines <- readLines("foo.txt")
markers <- grepl(">", Lines)
want <- rle(markers)$lengths[1:2]
want <- seq.int(want[1] + 1, sum(want), by = 1)
read.table(textConnection(Lines[want]), sep = " ", header = TRUE)

Run Code Online (Sandbox Code Playgroud)

这使:

> read.table(textConnection(Lines[want]), sep = " ", header = TRUE)
    K    L M
1 200 0.10 1
2 201 0.80 1
3 202 0.01 3
4 800 0.40 2

Run Code Online (Sandbox Code Playgroud)

在您提供的数据片段中(在文件中foo.txt,以及删除...行后).

函数(lapply)中textConnection()的微小副作用是连接得到gc() - ed,这会产生刺激性警告(无害).这可以通过textConnection()调用后的closeAllConnections()来解决. (2认同)

Answer 2

G. *_*eck 11

这有几种方法.

1)readLine读入文件的行L并设置skip为在开头跳过end.of.file的行数和标记数据结尾的行的行号.read.table然后该命令使用这两个变量重新读取数据.

File <- "foo.txt"

L <- readLines(File)
skip <- grep("^.{0,2}[^>]", L)[1] - 1
end.of.file <- grep("^>>> end of file", L)

read.table(File, header = TRUE, skip = skip, nrow = end.of.file - skip - 2)

Run Code Online (Sandbox Code Playgroud)

一种变化是用来textConnection代替File该read.table行:

read.table(textConnection(L), header = TRUE, 
   skip = skip, nrow = end.of.file - skip - 2)

Run Code Online (Sandbox Code Playgroud)

2)另一种可能性是使用sed或awk/gawk.考虑这一行gawk程序.如果程序看到标记数据结尾的行,则程序退出; 否则,如果该行以>>>开头,它会跳过当前行,如果这两行都没有,则打印该行.我们可以管理foo.txtgawk程序并使用它来读取它read.table.

cat("/^>>> end of file/ { exit }; /^>>>/ { next }; 1\n", file = "foo.awk")
read.table(pipe('gawk -f foo.awk foo.txt'), header = TRUE)

Run Code Online (Sandbox Code Playgroud)

这个的一个变化是我们可以省略/^>>>/ {next};gawk程序的一部分,它跳过>>>开头的行,并使用comment = ">" in theread.table`调用.

归档时间：	14 年，11 月前
查看次数：	1624 次
最近记录：	10 年，10 月前