struct.error:unpack需要长度为16的字符串参数

Dan*_*kin 7 python pdf pdf-parsing pdftotext pdfminer

使用pdfminer(pdf2txt.py)处理PDF 文件(2.pdf)时收到以下错误:

pdf2txt.py 2.pdf 

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
    StringIO(self.fontfile.get_data()))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
Run Code Online (Sandbox Code Playgroud)

虽然类似的文件(1.pdf)不会导致问题.

我找不到有关错误的任何信息.我在pdfminer GitHub存储库中添加了一个问题,但它仍未得到答复.有人可以向我解释为什么会这样吗?我该怎么做才能解析2.pdf


更新:我收到了类似的错误使用BytesIO,而不是StringIO安装pdfminer直接从GitHub的仓库.

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
    self.init_resources(resources)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
Run Code Online (Sandbox Code Playgroud)

Pet*_*ain 5

TL; DR

感谢@mkl和@hynecker的额外信息...有了这个我可以确认这是pdfminer和你的PDF中的错误.每当pdfminer尝试获取嵌入的文件流(例如字体定义)时,它就会在文件之前拾取文件中的最后一个endobj.遗憾的是,并非所有PDF都严格添加结束标记,因此pdfminer应该具有弹性.

快速解决此问题

我已经创建了一个补丁 - 它已经在github上作为pull请求提交了.请参阅https://github.com/euske/pdfminer/pull/159.

详细诊断

正如其他答案中所提到的,你看到这个的原因是你没有从流中获得预期的字节数,因为pdfminer正在解压缩数据.但为什么?

正如您在堆栈跟踪中看到的那样,pdfminer(正确地)发现它有一个要处理的CID字体.然后继续将嵌入的字体文件处理为TrueType字体(in pdffont.py).它尝试通过读出一组二进制表来解析关联的流(流ID 18).

这不起作用,2.pdf因为它有一个文本流.你可以通过运行看到这一点dumppdf -b -i 18 2.pdf.我在这里开始:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...
Run Code Online (Sandbox Code Playgroud)

垃圾输入,垃圾输出......这是你的文件或pdfminer中的错误吗?好吧,其他读者可以处理它的事实让我怀疑.

再多挖一点,我看到这个流流ID 17 相同,后者是该ToUnicode字段的cmap .快速浏览PDF规范可以看出这些不一样.

进一步深入研究代码,我发现所有流都获得了相同的数据.哎呀!这是错误.原因似乎与这个PDF缺少一些结束标签的事实有关 - 正如@hynecker所指出的那样.

修复是为每个流返回正确的数据.任何其他修复只是吞下错误将导致错误的数据被用于所有流,因此,例如,不正确的字体定义.

我相信附加的补丁会解决你的问题,一般来说应该是安全的.