将文本文件(如下面的示例)解析为两列data.frame然后转换为宽格式的最快方法是什么?
FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
Chiesa, Luca Maria
Brizzolari, Andrea
Santaniello, Enzo
Passero, Elena
Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015
Run Code Online (Sandbox Code Playgroud)
使用readLines是有问题的,因为多行字段没有键.读取固定宽度表也不起作用.建议?如果不是针对多行问题,可以使用对每个行/记录进行操作的函数轻松完成,如下所示:
x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key value
1 FN Thomson Reuters Web of Science
Run Code Online (Sandbox Code Playgroud)
注意:字段将始终为大写和两个字符.整个标题和作者列表可以连接成一个单元格.
这应该工作:
library(zoo)
x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)
x$V1[x$V1==" "] <- NA
x$V1 <- na.locf(x$V1)
res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")
Run Code Online (Sandbox Code Playgroud)