我有一个包含 40,000 多行的大型数据文件。它是日志输入的列表,看起来有点像这样:
D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
Run Code Online (Sandbox Code Playgroud)
由于它太大了,我不想将整个内容读入内存。我只需要以行标识符“F”开头并且有 (0, 0) 错误的行,如下所示:
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"
Run Code Online (Sandbox Code Playgroud)
其他一切我都可以忽略。我的问题是这样的:我想要一种方法来逐行读取该文件并评估它是否需要保留该行以供导入。目前,我正在使用for循环来运行每一行并使用该readLines()函数。它看起来像这样:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
line <- readLines("dataSet.txt", 1)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)
Run Code Online (Sandbox Code Playgroud)
它运行良好,但它给我的输出不是我想要的。它只是不断地一遍又一遍地打印文件的第一行。
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
Run Code Online (Sandbox Code Playgroud)
我怎样才能让它评估我是否需要该行,以及如何正确存储它(作为向量、数据帧、矩阵,这并不重要),以便我可以在 for 循环之外打印它?
更新
我已将我的代码更改为:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
line <- readLines(con, 1)
print(line)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)
Run Code Online (Sandbox Code Playgroud)
但是,当我检查存储在行中的值时,它说它是空的。我不明白为什么它改变了。此外,它告诉我 if 语句没有正确的 TRUE/FALSE 条件,这也让我感到困惑,因为 grepl() 应该返回 TRUE/FALSE 值。
更新
我设法消除了错误,但当我调用 Fdata 时仍然没有得到任何信息。我检查了我的变量,R 说该行是空的,没有字符。我分配错了吗?我希望 line 成为我在数据文件中解析并评估是否需要存储它的行。这是我更新的代码:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines("dataSet.txt))
for (i in 1:lineLength){
line <- readLines(con, 1)
print(line)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)
Run Code Online (Sandbox Code Playgroud)
看一下这个:
con <- file("test1.txt", "r")
lines <- c()
while(TRUE) {
line = readLines(con, 1)
if(length(line) == 0) break
else if(grepl("^\\s*F{1}", line) && grepl("(0,0)", line, fixed = TRUE)) lines <- c(lines, line)
}
lines
# [1] "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)\""
Run Code Online (Sandbox Code Playgroud)
将文件流传递给它,readLines以便它可以逐行读取它。使用正则表达式^\\s*F{1}捕获以字母开头F且可能有空格的行,其中^空格表示字符串的开头。用于fixed=T捕获 的精确匹配(0,0)。如果两项检查均为TRUE,则将结果附加到行中。
数据:
D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"
Run Code Online (Sandbox Code Playgroud)