Dan*_*kin 7 python pdf pdf-parsing pdftotext pdfminer
使用pdfminer(pdf2txt.py)处理PDF 文件(2.pdf)时收到以下错误:
pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdf2txt.py", line 115, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/usr/local/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
self.init_resources(resources)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = self.get_font(None, subspec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
font = PDFCIDFont(self, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
StringIO(self.fontfile.get_data()))
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
Run Code Online (Sandbox Code Playgroud)
虽然类似的文件(1.pdf)不会导致问题.
我找不到有关错误的任何信息.我在pdfminer GitHub存储库中添加了一个问题,但它仍未得到答复.有人可以向我解释为什么会这样吗?我该怎么做才能解析2.pdf?
更新:我收到了类似的错误使用BytesIO,而不是StringIO后安装pdfminer直接从GitHub的仓库.
$ pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
interpreter.process_page(page)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
self.init_resources(resources)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
font = self.get_font(None, subspec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = PDFCIDFont(self, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
BytesIO(self.fontfile.get_data()))
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
Run Code Online (Sandbox Code Playgroud)
TL; DR
感谢@mkl和@hynecker的额外信息...有了这个我可以确认这是pdfminer和你的PDF中的错误.每当pdfminer尝试获取嵌入的文件流(例如字体定义)时,它就会在文件之前拾取文件中的最后一个endobj.遗憾的是,并非所有PDF都严格添加结束标记,因此pdfminer应该具有弹性.
快速解决此问题
我已经创建了一个补丁 - 它已经在github上作为pull请求提交了.请参阅https://github.com/euske/pdfminer/pull/159.
详细诊断
正如其他答案中所提到的,你看到这个的原因是你没有从流中获得预期的字节数,因为pdfminer正在解压缩数据.但为什么?
正如您在堆栈跟踪中看到的那样,pdfminer(正确地)发现它有一个要处理的CID字体.然后继续将嵌入的字体文件处理为TrueType字体(in pdffont.py).它尝试通过读出一组二进制表来解析关联的流(流ID 18).
这不起作用,2.pdf因为它有一个文本流.你可以通过运行看到这一点dumppdf -b -i 18 2.pdf.我在这里开始:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...
Run Code Online (Sandbox Code Playgroud)
垃圾输入,垃圾输出......这是你的文件或pdfminer中的错误吗?好吧,其他读者可以处理它的事实让我怀疑.
再多挖一点,我看到这个流与流ID 17 相同,后者是该ToUnicode字段的cmap .快速浏览PDF规范可以看出这些不一样.
进一步深入研究代码,我发现所有流都获得了相同的数据.哎呀!这是错误.原因似乎与这个PDF缺少一些结束标签的事实有关 - 正如@hynecker所指出的那样.
修复是为每个流返回正确的数据.任何其他修复只是吞下错误将导致错误的数据被用于所有流,因此,例如,不正确的字体定义.
我相信附加的补丁会解决你的问题,一般来说应该是安全的.