use*_*002 2 pdf r column-width text-files dataframe
我正在尝试将数据从大量PDF文件转换为R中的数据框.我一直在使用read.fwf()将PDF文件转换为.txt文件,但问题是所有的宽度. txt文件不一样.有没有办法确定列的宽度,或者有没有办法使用read.fwf()以外的函数?
我有大量要转换的文件,它们都有不同的格式,因此找到每个文件的特定列宽变得非常繁琐.有没有更有效的方法将数据从PDF文件转换为R中的数据帧?
这是使用正则表达式的一种可能的解决方案.您可以使用包中的readPDF函数tm将PDF文件转换为文本,将每行作为文本字符串.然后使用正则表达式将数据分区到适当的列字段以转换为数据框.
我将它打包成一个函数,以便您可以读取和解析所有PDF文件,并在一次操作中将它们组合成一个数据框.如果您的其他文件在您发布的文件中没有格式化特性,那么您需要进行一些调整以使其正常工作.
该代码还检查一些简单的数据格式问题,并将"坏"行保存在单独的文本文件中,以便以后检查和处理.如果您的其他文件具有不同的格式变化,则可能需要再次调整此项.
# Use text-mining package to extract text from PDF files
library(tm)
# Function to read a PDF file and turn it into a data frame
PDFtoDF = function(file) {
## Extract PDF text. Each line of PDF becomes one element of the string vector dat.
dat = readPDF(control=list(text="-layout"))(elem=list(uri=file),
language="en", id="id1")
dat = c(as.character(dat))
## Keep only those strings that contain the data we want.
## These are the ones that begin with a number.
dat = dat[grep("^ {0,2}[0-9]{1,3}", dat)]
## Create separators so we can turn strings into a data frame. We'll use the
## pipe "|" as a separator.
# Add pipe after first number (the row number in the PDF file)
dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)
# Replace each instance of 2 or more spaces in a row with a pipe separator. This
# works because the company names have a single space between words, while data
# fields generally have more than one space between them.
# (We just need to first add an extra space in a few cases where there's only one
# space between two data fields.)
dat = gsub("(, HVOL )","\\1 ", dat)
dat = gsub(" {2,100}", "|", dat)
## Check for data format problems
# Identify rows without the right number of fields (there should
# be six pipe characters per row) and save them to a file for
# later inspection and processing (in this case row 11 of the PDF file is excluded))
excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)
# Remove the excluded rows from the string vector
dat = dat[!excludeRows]
## Convert string vector to data frame
dat = read.table(text=dat, sep="|", quote="", stringsAsFactors=FALSE)
names(dat) = c("RowNum", "Reference Entity", "Sub-Index", "CLIP",
"Reference Obligation", "CUSIP/ISIN", "Weighting")
return(dat)
}
# Create vector of names of files to read
files = list.files(pattern="CDX.*\\.pdf")
# Read each file, convert it to a data frame, then rbind into single data frame
df = do.call("rbind", lapply(files, PDFtoDF))
# Sample of data frame output from your sample file
df
RowNum Reference Entity Sub-Index CLIP Reference Obligation CUSIP/ISIN Weighting
1 1 ACE Limited FIN 0A4848AC9 ACE-INAHldgs 8.875 15Aug29 00440EAC1 0.008
2 2 Aetna Inc. FIN 0A8985AC5 AET 6.625 15Jun36 BondCall 00817YAF5 0.008
3 3 Alcoa Inc. INDU, HVOL 014B98AD5 AA 5.72 23Feb19 013817AP6 0.008
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6394 次 |
| 最近记录: |