如何在内存中修剪（裁剪）PDF 文档的底部空白

Question

如何在内存中修剪（裁剪）PDF 文档的底部空白

Kac*_*ski 5 html python pdf wkhtmltopdf

我正在使用wkhtmltopdf将（Django 模板化）HTML 文档渲染为单页 PDF 文件。我想要么立即以正确的高度渲染它（到目前为止我还没有做到），要么错误地渲染它并修剪它。我正在使用Python。

尝试类型 1：

wkhtmltopdf使用以下命令渲染为非常非常长的单页 PDF，其中包含大量额外空间--page-height
用于pdfCropMargins修剪：crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])

PDF 完美呈现，底部有 28 个单位的边距，但我必须使用文件系统来执行命令crop。该工具似乎需要输入文件和输出文件，并且还会在中途创建临时文件。所以我不能用它。

尝试类型 2：

wkhtmltopdf使用默认参数渲染为多页 PDF
使用PyPDF4(或PyPDF2) 读取文件并将页面组合成一个长的单页

在大多数情况下，PDF 都会呈现得很好，但是，如果碰巧最后一个 PDF 页面内容很少，有时会在底部看到大量额外的空白。

理想场景：

理想的场景将涉及一个函数，该函数接受 HTML 并将其呈现为单页 PDF，底部具有预期的空白量。我很乐意使用渲染 PDF wkhtmltopdf，因为它返回字节，然后处理这些字节以删除任何额外的空白。但我不想让文件系统参与其中，因为我想在内存中执行所有操作。也许我可以以某种方式直接检查 PDF 并手动删除空白，或者执行一些 HTML 魔法来预先确定渲染高度？

我现在在做什么：

注意这pdfkit是一个wkhtmltopdf包装器

# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")

# This is now valid HTML
rendered = template.render({
    "foo": "bar",
})

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

Run Code Online (Sandbox Code Playgroud)

它相当于Attempt type 2，只不过我不使用PyDPF4此处将页面缝合在一起，而是wkhtmltopdf使用预先计算的页面高度再次渲染。

Answer 1

Nei*_*eil 1

可能有更好的方法来做到这一点，但这至少有效。

我假设您能够自己裁剪 PDF，而我在这里所做的只是确定最后一页上还有内容的程度。如果这个假设是错误的，我可能会弄清楚如何裁剪 PDF。或者，只需裁剪图像（在 Pillow 中很容易），然后将其转换为 PDF？

另外，如果您有一个大 PDF，您可能需要计算整个 PDF 中文本结束的位置。我只是想知道内容在最后一页的下方结束了多少。但从一种到另一种的转换就像一个简单的算术问题。

测试代码：

import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO

# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz

# `pip install Pillow==8.3.1`
from PIL import Image

import numpy as np

# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")

image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)

#Uncomment if you want to see.
#image.show()

# Now figure out where the end of the text is:

# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel, 
# We convert the result to a true/false matrix 
# And then invert that. 
# The upshot is that, at the end, a value of "True" 
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))

# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape

last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
    if any(row):
        last_row = i
        break
    else:
        continue 

percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)

# Now you know where the page ends.
# Go back and crop the PDF accordingly.

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，1 月前
查看次数：	2198 次
最近记录：	3 年，7 月前