blz*_*blz 11 python pdf-parsing pypdf pdf-manipulation
我正在尝试使用PDFMiner python绑定从大量PDF中提取文本.我写的模块适用于许多PDF,但是对于一部分PDF,我得到了一些有些神秘的错误:
ipython堆栈跟踪:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Run Code Online (Sandbox Code Playgroud)
当然,我立即检查这些PDF是否已损坏,但它们可以被正确读取.
尽管没有根对象,有没有办法阅读这些PDF?我不太确定从哪里开始.
非常感谢!
编辑:
我尝试使用PyPDF试图获得一些差异诊断.堆栈跟踪如下:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Run Code Online (Sandbox Code Playgroud)
Quonux建议可能PDFMiner在到达第一个EOF字符后停止解析.这似乎暗示了其他方面,但我非常无能为力.有什么想法吗?
有趣的问题。我进行了某种研究:
解析pdf的函数(来自矿工源代码):
def set_parser(self, parser):
"Set the document to use a given PDFParser object."
if self._parser: return
self._parser = parser
# Retrieve the information of each header that was appended
# (maybe multiple times) at the end of the document.
self.xrefs = parser.read_xref()
for xref in self.xrefs:
trailer = xref.get_trailer()
if not trailer: continue
# If there's an encryption info, remember it.
if 'Encrypt' in trailer:
#assert not self.encryption
self.encryption = (list_value(trailer['ID']),
dict_value(trailer['Encrypt']))
if 'Info' in trailer:
self.info.append(dict_value(trailer['Info']))
if 'Root' in trailer:
# Every PDF file must have exactly one /Root dictionary.
self.catalog = dict_value(trailer['Root'])
break
else:
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
if self.catalog.get('Type') is not LITERAL_CATALOG:
if STRICT:
raise PDFSyntaxError('Catalog not found!')
return
Run Code Online (Sandbox Code Playgroud)
如果您遇到 EOF 问题,则会引发另一个异常:'''来自源的另一个函数'''
def load(self, parser, debug=0):
while 1:
try:
(pos, line) = parser.nextline()
if not line.strip(): continue
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
if not line:
raise PDFNoValidXRef('Premature eof: %r' % parser)
if line.startswith('trailer'):
parser.seek(pos)
break
f = line.strip().split(' ')
if len(f) != 2:
raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
try:
(start, nobjs) = map(long, f)
except ValueError:
raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
for objid in xrange(start, start+nobjs):
try:
(_, line) = parser.nextline()
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
f = line.strip().split(' ')
if len(f) != 3:
raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
(pos, genno, use) = f
if use != 'n': continue
self.offsets[objid] = (int(genno), long(pos))
if 1 <= debug:
print >>sys.stderr, 'xref objects:', self.offsets
self.load_trailer(parser)
return
Run Code Online (Sandbox Code Playgroud)
来自维基(pdf规范):PDF文件主要由对象组成,其中有八种类型:
Run Code Online (Sandbox Code Playgroud)Boolean values, representing true or false Numbers Strings Names Arrays, ordered collections of objects Dictionaries, collections of objects indexed by Names Streams, usually containing large amounts of data The null object
对象可以是直接的(嵌入在另一个对象中)或间接的。间接对象用对象编号和世代编号进行编号。称为外部参照表的索引表给出了每个间接对象从文件开头的字节偏移量。这种设计允许对文件中的对象进行高效的随机访问,并且还允许在不重写整个文件的情况下进行小的更改(增量更新)。从 PDF 1.5 版开始,间接对象也可能位于称为对象流的特殊流中。此技术可减小包含大量小型间接对象的文件的大小,并且对于标记 PDF 尤其有用。
我认为问题是您的“损坏的 pdf”在页面上有一些“根元素”。
Possible solution:
您可以在检索外部参照对象和解析器尝试解析此对象的每个位置下载源代码并编写“打印函数”。可以确定完整的错误堆栈(在出现此错误之前)。
ps:我认为这是产品中的某种错误。
小智 5
平板pdf中的解决方案是使用'rb' - >读取二进制模式.
因为slate pdf取决于PDFMiner并且我有同样的问题,这应该可以解决您的问题.
fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
doc = slate.PDF(fp)
print doc
Run Code Online (Sandbox Code Playgroud)