Ned*_*der 31
通常在PDF中,图像只是按原样存储.例如,插入了jpg的PDF将在中间某处有一个字节范围,当提取时是一个有效的jpg文件.您可以使用它非常简单地从PDF中提取字节范围.我前段时间写过这篇文章,示例代码:从PDF中提取JPG.
syl*_*ain 31
在使用PyPDF2和Pillow库的Python中,它很简单:
import PyPDF2
from PIL import Image
if __name__ == '__main__':
input1 = PyPDF2.PdfFileReader(open("input.pdf", "rb"))
page0 = input1.getPage(0)
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
Run Code Online (Sandbox Code Playgroud)
小智 23
在Python中使用PyPDF2进行CCITTFaxDecode过滤:
import PyPDF2
import struct
"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""
def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)
pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.
K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()
Run Code Online (Sandbox Code Playgroud)
kat*_*yna 20
您可以使用模块PyMuPDF.这会将所有图像输出为.png文件,但开箱即用并且速度很快.
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
Run Code Online (Sandbox Code Playgroud)
dka*_*dal 14
Libpoppler附带了一个名为"pdfimages"的工具.
(在ubuntu系统上,它位于poppler-utils包中)
http://poppler.freedesktop.org/
http://en.wikipedia.org/wiki/Pdfimages
Windows二进制文件:http://blog.alivate.com.au/poppler-windows/
and*_*otn 10
PikePDF可以用很少的代码来做到这一点:
\nfrom pikepdf import Pdf, PdfImage\n\nfilename = "sample-in.pdf"\nexample = Pdf.open(filename)\n\nfor i, page in enumerate(example.pages):\n for j, (name, raw_image) in enumerate(page.images.items()):\n image = PdfImage(raw_image)\n out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")\nRun Code Online (Sandbox Code Playgroud)\nextract_to将根据图像在 PDF 中的编码方式自动选择文件扩展名。
如果需要,您还可以在提取图像时打印有关图像的一些详细信息:
\n # Optional: print info about image\n w = raw_image.stream_dict.Width\n h = raw_image.stream_dict.Height\n f = raw_image.stream_dict.Filter\n size = raw_image.stream_dict.Length\n\n print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")\nRun Code Online (Sandbox Code Playgroud)\n它可以打印类似的东西
\nWrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg\nWrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png\n...\nRun Code Online (Sandbox Code Playgroud)\n请参阅文档以了解\n可以对图像执行的更多操作,包括在 PDF 文件中替换它们。
\n虽然这通常效果很好,但请注意,有许多图像无法以这种方式提取\xe2\x80\x99:
\n我从@sylvain的代码开始有一些缺陷,比如NotImplementedError: unsupported filter /DCTDecodegetData 的例外,或者代码在某些页面中找不到图像的事实,因为它们处于比页面更深的层次.
有我的代码:
import PyPDF2
from PIL import Image
import sys
from os import path
import warnings
warnings.filterwarnings("ignore")
number = 0
def recurse(page, xObject):
global number
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj]._data
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(imagename + ".png")
number += 1
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(imagename + ".jpg", "wb")
img.write(data)
img.close()
number += 1
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(imagename + ".jp2", "wb")
img.write(data)
img.close()
number += 1
else:
recurse(page, xObject[obj])
try:
_, filename, *pages = sys.argv
*pages, = map(int, pages)
abspath = path.abspath(filename)
except BaseException:
print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
sys.exit()
file = PyPDF2.PdfFileReader(open(filename, "rb"))
for p in pages:
page0 = file.getPage(p-1)
recurse(p, page0)
print('%s extracted images'% number)
Run Code Online (Sandbox Code Playgroud)
我更喜欢minecart,因为它非常易于使用。以下代码段显示了如何从pdf中提取图像:
#pip install minecart
import minecart
pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(0) # getting a single page
#iterating through all pages
for page in doc.iter_pages():
im = page.images[0].as_pil() # requires pillow
display(im)
Run Code Online (Sandbox Code Playgroud)
这是我 2019 年的版本,它递归地从 PDF 中获取所有图像并使用 PIL 读取它们。与 Python 2/3 兼容。我还发现有时PDF中的图像可能会被zlib压缩,因此我的代码支持解压缩。
#!/usr/bin/env python3
try:
from StringIO import StringIO
except ImportError:
from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib
def get_color_mode(obj):
try:
cspace = obj['/ColorSpace']
except KeyError:
return None
if cspace == '/DeviceRGB':
return "RGB"
elif cspace == '/DeviceCMYK':
return "CMYK"
elif cspace == '/DeviceGray':
return "P"
if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
color_map = obj['/ColorSpace'][1].getObject()['/N']
if color_map == 1:
return "P"
elif color_map == 3:
return "RGB"
elif color_map == 4:
return "CMYK"
def get_object_images(x_obj):
images = []
for obj_name in x_obj:
sub_obj = x_obj[obj_name]
if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())
elif sub_obj['/Subtype'] == '/Image':
zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
if zlib_compressed:
sub_obj._data = zlib.decompress(sub_obj._data)
images.append((
get_color_mode(sub_obj),
(sub_obj['/Width'], sub_obj['/Height']),
sub_obj._data
))
return images
def get_pdf_images(pdf_fp):
images = []
try:
pdf_in = PdfFileReader(open(pdf_fp, "rb"))
except:
return images
for p_n in range(pdf_in.numPages):
page = pdf_in.getPage(p_n)
try:
page_x_obj = page['/Resources']['/XObject'].getObject()
except KeyError:
continue
images += get_object_images(page_x_obj)
return images
if __name__ == "__main__":
pdf_fp = "test.pdf"
for image in get_pdf_images(pdf_fp):
(mode, size, data) = image
try:
img = Image.open(StringIO(data))
except Exception as e:
print ("Failed to read image with PIL: {}".format(e))
continue
# Do whatever you want with the image
Run Code Online (Sandbox Code Playgroud)
好吧,我已经为此苦苦挣扎了好几个星期,其中许多答案都帮助我度过了难关,但总是缺少一些东西,显然这里没有人遇到过jbig2 编码图像的问题。
在我要扫描的一堆 PDF 中,jbig2 编码的图像非常流行。
据我了解,有许多复印/扫描机可以扫描纸张并将其转换为包含 jbig2 编码图像的 PDF 文件。
因此,经过多天的测试,决定寻求 dkgedal 很久以前在这里提出的答案。
这是我在 Linux 上的一步一步:(如果你有其他操作系统,我建议使用Linux docker,这会容易得多。)
第一步:
apt-get install poppler-utils
Run Code Online (Sandbox Code Playgroud)
然后我可以运行名为 pdfimages 的命令行工具,如下所示:
pdfimages -all myfile.pdf ./images_found/
Run Code Online (Sandbox Code Playgroud)
使用上面的命令,您将能够提取myfile.pdf 中包含的所有图像,并将它们保存在 images_found 中(您必须之前创建 images_found )
在列表中您会发现多种类型的图像,png、jpg、tiff;所有这些都可以使用任何图形工具轻松读取。
然后您将得到一些名为 -145.jb2e 和 -145.jb2g 的文件。
这 2 个文件包含一张用 jbig2 编码的图像,保存在 2 个不同的文件中,一个用于标题,一个用于数据
我又花了很多天的时间试图找出如何将这些文件转换为可读的文件,最后我遇到了这个名为 jbig2dec 的工具
所以首先你需要安装这个神奇的工具:
apt-get install jbig2dec
Run Code Online (Sandbox Code Playgroud)
然后你可以运行:
jbig2dec -t png -145.jb2g -145.jb2e
Run Code Online (Sandbox Code Playgroud)
您最终将能够将所有提取的图像转换为有用的东西。
祝你好运!
小智 6
我为自己的程序执行了此操作,发现最好使用的库是 PyMuPDF。它可以让您找出每页上每个图像的“外部参照”编号,并使用它们从 PDF 中提取原始图像数据。
import fitz
from PIL import Image
import io
filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)
#loads the first page
page = doc.loadPage(0)
#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]
#gets the image as a dict, check docs under extractImage
baseImage = doc.extractImage(xref)
#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))
#Displays image for good measure
image.show()
Run Code Online (Sandbox Code Playgroud)
不过,一定要查看文档。
更简单的解决方案:
使用 poppler-utils 包。要安装它,请使用自制软件(自制软件是特定于 MacOS 的,但您可以在此处找到适用于 Widows 或 Linux 的 poppler-utils 软件包:https : //poppler.freedesktop.org/)。下面的第一行代码使用自制软件安装 poppler-utils。安装后,第二行(从命令行运行)然后从 PDF 文件中提取图像并将它们命名为“image*”。要从 Python 中运行此程序,请使用 os 或 subprocess 模块。第三行是使用 os 模块的代码,下面是一个带有子进程的示例(python 3.5 或更高版本用于 run() 函数)。更多信息在这里:https : //www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/
brew install poppler
pdfimages file.pdf image
import os
os.system('pdfimages file.pdf image')
Run Code Online (Sandbox Code Playgroud)
或者
import subprocess
subprocess.run('pdfimages file.pdf image', shell=True)
Run Code Online (Sandbox Code Playgroud)
经过一番搜索,我发现以下脚本非常适合我的PDF。它只能处理JPG,但可以与我不受保护的文件完美配合。也是不需要任何外部库的。
不客气,该脚本源自Ned Batchelder,而不是我。Python3代码:从pdf中提取jpg。又快又脏
import sys
with open(sys.argv[1],"rb") as file:
file.seek(0)
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
67521 次 |
| 最近记录: |