使用pdfminer(pdf2txt.py)处理PDF 文件(2.pdf)时收到以下错误:
pdf2txt.py 2.pdf 
Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__ …我正在尝试解析包含印度选民名单的pdf文件,该名单位于印地文(梵文脚本).
PDF正确显示所有文本,但当我尝试使用PDFminer将此pdf转换为文本格式时,它输出与原始pdf字符不同的字符
例如,显示/更正单词是सामान्य
但输出词是सपमपनद
现在我想知道为什么会发生这种情况,以及如何正确解析这种类型的pdf文件
我还包括样本pdf文件 -
I am trying to install pdfMiner to work with CollectiveAccess. My host (pair.com) has given me the following information to help in this quest:
When compiling, it will likely be necessary to instruct the
installation to use your account space above, and not try to install
into the operating system directories. Typically, using "--
home=/usr/home/username/pdfminer" at the end of the install command should allow for that.
I followed this instruction when trying to install. The result was:
running install
running …我正在为我的PDF数据集开发自定义搜索引擎。
我有一个转换层,可以将PDF内容转储到文本中(使用Apache Tika和GROBID)。我已经完成搜索层和返回搜索结果列表的视图。
现在,我想在原始PDF上为出现搜索字词的行添加突出显示功能。是的,如果需要,我想修改PDF文件。
有什么办法可以突出显示PDF文件中的文本?PDFMiner或PyPDF2或其他Python库是否可以做到这一点?
...还是可以要求其他服务,也许是外部服务?
使用pdfminer(pdf2txt.py)处理文件时,我收到空输出:
dan@work:~/project$ pdf2txt.py  docs/homericaeast.pdf 
dan@work:~/project$ 
任何人都可以说这个文件有什么问题,我可以做些什么来从中获取数据?
这是dumppdf.py  docs/homericaeast.pdf输出:
<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on
¤µF¤5Á>ó_ýv¬`</string>
<string size="16">on
¤µF¤5Á>ó_ýv¬`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on
¤µF¤5Á>ó_ýv¬`</string>
<string size="16">on
¤µF¤5Á>ó_ýv¬`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
我正在尝试从 pdf 文档中的某些表格中提取信息。
考虑输入:
Title 1
some text some text some text some text some text
some text some text some text some text some text
Table Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |
Title 2
some more text some more text some more text some more text
some more text
some more text some more text …如果我在答案中使用代码: 在Python中使用PDFMiner从PDF文件中提取文本?
我可以在申请PDF格式时提取文本:https://www.tencent.com/en-us/articles/15000691526464720.pdf
但是,您在"合并收入报表"下看到,它会读取...即... Revenues VAS Online advertising然后它会读取数字...我希望它能够读取,即:
Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 等等......有没有办法做到这一点?
寻找其他可能的解决方案pdfminer.
如果我尝试使用此代码,PyPDF2并非所有文本都出现:
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open(file, 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
a=(pdfReader.numPages)
# creating a page object
for i in range(0,a):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())
我正在尝试在给定的边界矩形内提取 pdf 的文本。据我所知,有一些用于 pdf 抓取的工具,例如 pdfminer、pypdf 和 pdftotext。我已经尝试了所有 3 个方法,到目前为止,我只获得了 pdftotext 的代码,用于从给定的边界框中提取文本。该代码看起来像这样:
s = "pdftotext -x %d -y %d -w %d -h %d"
s = s%(<various inputs into my function>)
cmd = [s, pdf_path,
           text_out]
subprocess.call(cmd)
但是,这会输出/写入一个文本文件。我想立即使用该文本,这意味着我不想打开一个文本文件来检索该边界框中的任何单词,因为我将为 10,000 多个文档执行此操作,并打开那么多文件可能是一种痛苦。我基本上是从 python 脚本运行命令行提示符,所以我认为实际上没有办法解决这个问题,但我不确定。由于 pdfminer 和 pypdf 是实际的 python 包,我可以获取它们的文本,但它们似乎没有任何方法在给定的像素限制内提取文本。
进一步说明 - 我希望专门在 python 中执行此操作,因为我有大量针对同一个总体项目的其他代码。
I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line
from pdfminer.pdfparser import PDFParser, PDFDocument: 
ImportError: cannot import name 'PDFDocument' from 'pdfminer.pdfparser' (C:\Users[username]\Anaconda3\lib\site-packages\pdfminer\pdfparser.py)
I'm using Anaconda Jupyter. Python 3.7.3. Package pdfminer.six-20181108
The code I'm using is based on this: How to read pdf file using pdfminer3k?
Based on advice given below I've tried to uninstall and reinstall Anaconda and pdfminer.six and …
我已经编写了从 PDF 文件中抓取所有数据的 python 代码。这里的问题是,一旦被刮掉,单词就会失去语法。如何解决这些问题?我附上代码。
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()
         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf") …pdfminer ×10
python ×9
pdf ×8
parsing ×2
pdf-parsing ×2
pypdf ×2
hindi ×1
ocr ×1
pdf-scraping ×1
pdftotext ×1
pypdf2 ×1
python-3.x ×1
search ×1