Python Data Extraction from an Encrypted PDF

Question

Python Data Extraction from an Encrypted PDF

Beg*_*ner 11 python pdf encryption extraction

I am an recent graduate in pure mathematics who only has taken few basic programming courses. I am doing an internship and I have an internal data analysis project. I have to analyze the internal PDFs of the last years. The PDFs are "secured." In other words, they are encrypted. We do not have PDF passwords, even more, we are not sure if passwords exist. But, we have all these documents and we can read them manually. We can print them as well. The goal is to read them with Python because is the language that we have some idea.

First, I tried to read the PDFs with some Python libraries. However, the Python libraries that I found do not read encrypted PDFs. At that time, I could not export the information using Adobe Reader either.

Second, I decided to decrypt the PDFs. I was successful using the Python library pykepdf. Pykepdf works very well! However, the decrypted PDFs cannot be read as well with the Python libraries of the previous point (PyPDF2 and Tabula). At this time, we have made some improvement because using Adobe Reader I can export the information from the decrypted PDFs, but the goal is to do everything with Python.

The code that I am showing works perfectly with unencrypted PDFs, but not with encrypted PDFs. It is not working with the decrypted PDFs that were gotten with pykepdf as well.

I did not write the code. I found it in the documentation of the Python libraries Pykepdf and Tabula. The PyPDF2 solution was written by Al Sweigart in his book, "Automate the Boring Stuff with Python," that I highly recommend. I also checked that the code is working fine, with the limitations that I explained before.

First question, why I cannot read the decrypted files, if the programs work with files that never have been encrypted?

Second question, Can we read with Python the decrypted files somehow? Which library can do it or is impossible? Are all decrypted PDFs extractable?

Thank you for your time and help!!!

I found these results using Python 3.7, Windows 10, Jupiter Notebooks, and Anaconda 2019.07.

Python

import pikepdf
with pikepdf.open("encrypted.pdf") as pdf:
  num_pages = len(pdf.pages)
  del pdf.pages[-1]
  pdf.save("decrypted.pdf")

import tabula
tabula.read_pdf("decrypted.pdf", stream=True)

import PyPDF2
pdfFileObj=open("decrypted.pdf", "rb")
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj=pdfReader.getPage(0)
pageObj.extractText()

Run Code Online (Sandbox Code Playgroud)

With Tabula, I am getting the message "the output file is empty."

With PyPDF2, I am getting only '/n'

UPDATE 10/3/2019 Pdfminer.six (Version November 2018)

I got better results using the solution posted by DuckPuncher. For the decrypted file, I got the labels, but not the data. Same happens with the encrypted file. For the file that has never been encrypted works perfect. As I need the data and the labels of encrypted or decrypted files, this code does not work for me. For that analysis, I used pdfminer.six that is Python library that was released in November 2018. Pdfminer.six includes a library pycryptodome. According to their documentation "PyCryptodome is a self-contained Python package of low-level cryptographic primitives.."

The code is in the stack exchange question: Extracting text from a PDF file using PDFMiner in python?

I would love if you want to repeat my experiment. Here is the description:

1) Run the codes mention in this question with any PDF that never has been encrypted.

2) Do the same with a PDF "Secure" (this is a term that Adobe uses), I am calling it the encrypted PDF. Use a generic form that you can find using Google. After you download it, you need to fill the fields. Otherwise, you would be checking for labels, but not fields. The data is in the fields.

3) Decrypt the encrypted PDF using Pykepdf. This will be the decrypted PDF.

4) Run the codes again using the decrypted PDF.

UPDATE 10/4/2019 Camelot (Version July 2019)

I found the Python library Camelot. Be careful that you need camelot-py 0.7.3.

It is very powerful, and works with Python 3.7. Also, it is very easy to use. First, you need also to install Ghostscript. Otherwise, it will not work. You need also to install Pandas. Do not use pip install camelot-py. Instead use pip install camelot-py[cv]

The author of the program is Vinayak Mehta. Frank Du shares this code in a YouTube video "Extract tabular data from PDF with Camelot Using Python."

I checked the code and it is working with unencrypted files. However, it does not work with encrypted and decrypted files, and that is my goal.

Camelot is oriented to get tables from PDFs.

Here is the code:

Python

import camelot
import pandas
name_table = camelot.read_pdf("uncrypted.pdf")
type(name_table)

#This is a Pandas dataframe
name_table[0]

first_table = name_table[0]   

#Translate camelot table object to a pandas dataframe
first_table.df

first_table.to_excel("unencrypted.xlsx")
#This creates an excel file.
#Same can be done with csv, json, html, or sqlite.

#To get all the tables of the pdf you need to use this code.
for table in name_table:
   print(table.df)

Run Code Online (Sandbox Code Playgroud)

UPDATE 10/7/2019 I found one trick. If I open the secured pdf with Adobe Reader, and I print it using Microsoft to PDF, and I save it as a PDF, I can extract the data using that copy. I also can convert the PDF file to JSON, Excel, SQLite, CSV, HTML, and another formats. This is a possible solution to my question. However, I am still looking for an option to do it without that trick because the goal is to do it 100% with Python. I am also concerned that if a better method of encryption is used the trick maybe would not work. Sometimes you need to use Adobe Reader several times to get an extractable copy.

UPDATE 10/8/2019. Third question. I have now a third question. Do all secured/encrypted pdf are password protected? Why pikepdf is not working? My guess is that the current version of pikepdf can break some type of encryptions but not all of them. @constt mentioned that PyPDF2 can break some type of protection. However, I replied to him that I found an article that PyPDF2 can break encryptions made with Adobe Acrobat Pro 6.0, but no with posterior versions.

Answer 1

Lif*_*lex 7

最后更新10-11-2019

我不确定我是否完全理解您的问题。可以完善下面的代码，但它可以读取加密的或未加密的PDF并提取文本。如果我误解了您的要求，请告诉我。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_encrypted_pdf_text(path, encryption_true, decryption_password):

  output = StringIO()

  resource_manager = PDFResourceManager()
  laparams = LAParams()

  device = TextConverter(resource_manager, output, codec='utf-8', laparams=laparams)

  pdf_infile = open(path, 'rb')
  interpreter = PDFPageInterpreter(resource_manager, device)

  page_numbers = set()

  if encryption_true == False:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, caching=True, check_extractable=True):
      interpreter.process_page(page)

  elif encryption_true == True:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, password=decryption_password, caching=True, check_extractable=True):
      interpreter.process_page(page)

 text = output.getvalue()
 pdf_infile.close()
 device.close()
 output.close()
return text

results = extract_encrypted_pdf_text('encrypted.pdf', True, 'password')
print (results)

Run Code Online (Sandbox Code Playgroud)

我注意到您用来打开加密PDF的pikepdf代码缺少密码，该密码应该引发以下错误消息：

pikepdf._qpdf.PasswordError：加密.pdf：无效的密码

import pikepdf

with pikepdf.open("encrypted.pdf", password='password') as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")

Run Code Online (Sandbox Code Playgroud)

您可以使用tika从pikepdf创建的decrypted.pdf中提取文本。

from tika import parser

parsedPDF = parser.from_file("decrypted.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')

Run Code Online (Sandbox Code Playgroud)

此外，pikepdf 当前未实现文本提取，其中包括最新版本v1.6.4。

我决定使用各种加密的PDF文件进行几次测试。

我将所有加密文件都命名为“ encrypted.pdf”，它们都使用相同的加密和解密密码。

Adobe Acrobat 9.0及更高版本-加密级别256位AES
- pikepdf能够解密此文件
- PyPDF2无法正确提取文本
- 蒂卡可以正确提取文本
Adobe Acrobat 6.0及更高版本-加密级别128位RC4
- pikepdf能够解密此文件
- PyPDF2无法正确提取文本
- 蒂卡可以正确提取文本
Adobe Acrobat 3.0及更高版本-加密级别40位RC4
- pikepdf能够解密此文件
- PyPDF2无法正确提取文本
- 蒂卡可以正确提取文本
Adobe Acrobat 5.0及更高版本-加密级别128位RC4
- 用Microsoft Word创建
- pikepdf能够解密此文件
- PyPDF2可以正确提取文本
- 蒂卡可以正确提取文本
Adobe Acrobat 9.0及更高版本-加密级别256位AES
- 使用pdfprotectfree创建
- pikepdf能够解密此文件
- PyPDF2可以正确提取文本
- 蒂卡可以正确提取文本

PyPDF2能够从未使用Adobe Acrobat创建的解密PDF文件中提取文本。

我认为这些故障与Adobe Acrobat创建的PDF中的嵌入式格式有关。需要更多测试以确认有关格式的这种推测。

tika能够从使用pikepdf解密的所有文档中提取文本。

 import pikepdf
 with pikepdf.open("encrypted.pdf", password='password') as pdf:
    num_pages = len(pdf.pages)
    del pdf.pages[-1]
    pdf.save("decrypted.pdf")


 from PyPDF2 import PdfFileReader

 def text_extractor(path):
   with open(path, 'rb') as f:
     pdf = PdfFileReader(f)
     page = pdf.getPage(1)
     print('Page type: {}'.format(str(type(page))))
     text = page.extractText()
     print(text)

    text_extractor('decrypted.pdf')

Run Code Online (Sandbox Code Playgroud)

PyPDF2无法解密Acrobat PDF文件=> 6.0

自2015年9月15日起，此问题已向模块所有者开放。在与该问题相关的注释中尚不清楚，项目所有者何时可以解决此问题。上一次提交是2018年6月25日。

PyPDF4解密问题

PyPDF4是PyPDF2的替代品。该模块还具有用于加密PDF文件的某些算法的解密问题。

测试文件：Adobe Acrobat 9.0和更高版本-加密级别256位AES

PyPDF2错误消息：仅支持算法代码1和2

PyPDF4错误消息：仅支持算法代码1和2。该PDF使用代码5

更新部分10-11-2019

本部分是对您在10-07-2019和10-08-2019上的更新的回应。

在更新中，您声明可以打开“使用Adobe Reader进行保护的pdf”并将文档打印为另一个PDF，从而删除了“ SECURED”标志。经过一些测试，我相信已经弄清楚了在这种情况下发生了什么。

Adobe PDF的安全级别

Adobe PDF具有可由文档所有者启用的多种类型的安全控制。可以使用密码或证书来实施控件。

文档加密（通过文档打开密码强制执行）
- 加密所有文档内容（最常见）
- 加密除元数据=> Acrobat 6.0之外的所有文档内容
- 仅加密文件附件=> Acrobat 7.0
限制性编辑和打印（使用权限密码强制执行）
- 允许打印
- 允许更改

下图显示了使用256位AES加密进行加密的Adobe PDF。要打开或打印此PDF，需要输入密码。当您使用密码在Adobe Reader中打开此文档时，标题将显示为SECURED

本文档要求使用此答案中提到的Python模块打开密码。如果您尝试使用Adobe Reader打开加密的PDF。您应该看到以下内容：

如果未收到此警告，则说明该文档没有启用安全控制，或者仅启用了限制性的编辑和打印功能。

下图显示了使用PDF文档中的密码启用的严格编辑。注意打印已启用。要打开或打印此PDF，不需要密码。当您在没有密码的情况下在Adobe Reader中打开该文档时，标题将显示SECURED （警告）。这与使用密码打开的加密PDF相同。

当您将此文档打印到新的PDF时，“ 安全”警告将被删除，因为限制性编辑已被删除。

所有Adobe产品都强制执行权限密码设置的限制。但是，如果第三方产品不支持这些设置，则文档收件人可以绕过部分或全部限制。

因此，我假设您要打印为PDF的文档已启用限制性编辑，并且没有启用打开所需的密码。

关于破解PDF加密

无论PyPDF2或PyPDF4是要打破PDF文档的文件打开密码功能。如果两个模块都尝试打开加密的受密码保护的PDF文件，则将引发以下错误。

PyPDF2.utils.PdfReadError：文件尚未解密

可以使用多种方法绕过加密的PDF文件的打开密码功能，但是由于多种因素（包括密码复杂性），单一技术可能无法正常工作，并且某些技术将不可接受。

PDF加密内部使用40、128或256位的加密密钥，具体取决于PDF版本。二进制加密密钥源自用户提供的密码。密码受长度和编码限制。

例如，PDF 1.7 Adobe Extension Level 3（Acrobat 9-AES-256）引入了Unicode字符（65,536个可能的字符），并在密码的UTF-8表示形式中将最大长度增加到127个字节。

以下代码将打开一个启用了限制性编辑的PDF。它将在不添加SECURED警告的情况下将该文件保存到新的PDF。该蒂卡代码将从新的文件解析的内容。

from tika import parser
import pikepdf

# opens a PDF with restrictive editing enabled, but that still 
# allows printing.
with pikepdf.open("restrictive_editing_enabled.pdf") as pdf:
  pdf.save("restrictive_editing_removed.pdf")

  # plain text output
  parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf", xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  print (pdf)

Run Code Online (Sandbox Code Playgroud)

此代码检查打开文件是否需要密码。可以完善此代码，并可以添加其他功能。可以添加其他一些功能，但是pikepdf的文档与代码库中的注释不匹配，因此需要更多的研究来改进此功能。

# this would be removed once logging is used
############################################
import sys
sys.tracebacklimit = 0
############################################

import pikepdf
from tika import parser

def create_pdf_copy(pdf_file_name):
  with pikepdf.open(pdf_file_name) as pdf:
    new_filename = f'copy_{pdf_file_name}'
    pdf.save(new_filename)
    return  new_filename

def extract_pdf_content(pdf_file_name):
  # plain text output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  parsedPDF = parser.from_file(pdf_file_name, xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  return pdf

def password_required(pdf_file_name):
  try:
    pikepdf.open(pdf_file_name)

  except pikepdf.PasswordError as error:
    return ('password required')

  except pikepdf.PdfError as results:
    return ('cannot open file')


filename = 'decrypted.pdf'
password = password_required(filename)
if password != None:
  print (password)
elif password == None:
  pdf_file = create_pdf_copy(filename)
  results = extract_pdf_content(pdf_file)
  print (results)

Run Code Online (Sandbox Code Playgroud)

如何在不提供密码的情况下打开安全的PDF文件？ (2认同)

归档时间：	6 年，5 月前
查看次数：	886 次
最近记录：	6 年，3 月前