Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

Question

Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line

from pdfminer.pdfparser import PDFParser, PDFDocument:

Run Code Online (Sandbox Code Playgroud)

ImportError: cannot import name 'PDFDocument' from 'pdfminer.pdfparser' (C:\Users[username]\Anaconda3\lib\site-packages\pdfminer\pdfparser.py)

I'm using Anaconda Jupyter. Python 3.7.3. Package pdfminer.six-20181108

The code I'm using is based on this: How to read pdf file using pdfminer3k?

Based on advice given below I've tried to uninstall and reinstall Anaconda and pdfminer.six and other packages several times: https://github.com/pdfminer/pdfminer.six/issues/196 A week ago it suddenly worked, but now I get an error again.

Since I'm working on Win10 I also tried using Linux Ubuntu as described here: https://medium.com/hugo-ferreiras-blog/using-windows-subsystem-for-linux-for-data-science-9a8e68d7610c

Same error.

Then, based on the webpage below I thought it was worth a try to split PDFparser, PDFDocument: from

from pdfminer.pdfparser import PDFParser, PDFDocument

Run Code Online (Sandbox Code Playgroud)

to

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

Run Code Online (Sandbox Code Playgroud)

https://loctv.wordpress.com/2017/02/07/fix-importerror-cannot-import-name-pdfdocument-when-using-slate/ .. But that created new errors later on in the code.

The start of my code looks like this:

```
path = [name and path of file]
fp = open(path, 'rb')
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
```

Run Code Online (Sandbox Code Playgroud)

I expect to be able to run the code and extract the text from the pdf-file, but the code is stopped by the error relating to PDFDocument pdfminer.pdfparser

Any advice on what I should do is much appreciated! Might it has something to do with how pdfminer.six is installed?

Answer 1

Ing*_*org 6

我得到了 Notodden Serit 的帮助。改变这个：

from pdfminer.pdfparser import PDFParser, PDFDocument

Run Code Online (Sandbox Code Playgroud)

到：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

Run Code Online (Sandbox Code Playgroud)

并添加解析器

doc = PDFDocument()

Run Code Online (Sandbox Code Playgroud)

到：

doc = PDFDocument(parser)

Run Code Online (Sandbox Code Playgroud)

进而：

for page in doc.get_pages():

Run Code Online (Sandbox Code Playgroud)

到：

for page in PDFPage.create_pages(doc):

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，9 月前
查看次数：	8374 次
最近记录：	5 年，3 月前