在Python中从PDF中提取页面大小

Question

在Python中从PDF中提取页面大小

13 python pdf

我想阅读PDF并获取它的页面列表和每个页面的大小.我不需要以任何方式操纵它,只需阅读它.

目前正在尝试使用pyPdf,除了获取页面大小的方法外,它还能完成我需要的一切.了解我可能需要迭代,因为页面大小可能会在pdf文档中有所不同.我可以使用另一种libray /方法吗？

我尝试使用PIL,一些在线食谱甚至有d = Image(imagefilename)用法,但它永远不会读取我的任何PDF文件 - 它会读取我投入的所有内容 - 甚至一些我不知道PIL可以做的事情.

任何指导赞赏 - 我在Windows 7 64,python25(因为我也做GAE的东西),但我很高兴在Linux或更现代的pythiis.

Answer 1

Jos*_*Lee 27

这可以使用PyPDF2完成:

>>> from PyPDF2 import PdfFileReader
>>> input1 = PdfFileReader(open('example.pdf', 'rb'))
>>> input1.getPage(0).mediaBox
RectangleObject([0, 0, 612, 792])

Run Code Online (Sandbox Code Playgroud)

(以前称为pyPdf,仍然引用其文档.)

@Astrophe 是正确的，但有一个更具可读性（但未记录）的解决方案。如果你看一下 [the source](https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/generic.py#L854) 你可以看到 `RectangleObject` 类有几个方便的方法，包括 `getWidth()`（和 `getHeight()`），它比 `mediaBox[2]` **好多** (5认同)
[0,0,宽度,高度] (4认同)
通常，长度以点为单位：1 pt。= 1/72英寸 (2认同)

Answer 2

Jam*_*ier 13

使用pdfrw：

>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']

Run Code Online (Sandbox Code Playgroud)

长度以磅为单位（1 pt = 1/72 英寸）。格式是['0', '0', width, height]（谢谢，Astrophe！）。

*“长度以点为单位”*，除非页面包含可用于更改此处单位的 **UserUnit** 条目。诚然，这个选项很少使用。 (3认同)
“格式为 ['0', '0', 宽度, 高度]” - 这是错误的。格式为[x0,y0,x1,y1]。不一定需要从0开始。 (3认同)

Answer 3

Myo*_*aiz 7

对于 pdfminer python 3.x (pdfminer.six)（没有在 python 2.7 上尝试）：

parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
    print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
    pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page

Run Code Online (Sandbox Code Playgroud)

Answer 4

mar*_*004 7

使用 pikepdf：

import pikepdf

# open the file and select the first page
pdf = pikepdf.Pdf.open("/path/to/file.pdf")
page = pdf.pages[0]

if '/CropBox' in page:
    # use CropBox if defined since that's what the PDF viewer would usually display
    relevant_box = page.CropBox
elif '/MediaBox' in page:
    relevant_box = page.MediaBox
else:
    # fall back to ANSI A (US Letter) if neither CropBox nor MediaBox are defined
    # unlikely, but possible
    relevant_box = [0, 0, 612, 792]

# actually there could also be a viewer preference ViewArea or ViewClip in
# pdf.Root.ViewerPreferences defining which box to use, but most PDF readers 
# disregard this option anyway

# check whether the page defines a UserUnit
userunit = 1
if '/UserUnit' in page:
    userunit = float(page.UserUnit)

# convert the box coordinates to float and multiply with the UserUnit
relevant_box = [float(x)*userunit for x in relevant_box]

# obtain the dimensions of the box
width  = abs(relevant_box[2] - relevant_box[0])
height = abs(relevant_box[3] - relevant_box[1])

rotation = 0
if '/Rotate' in page:
    rotation = page.Rotate

# if the page is rotated clockwise or counter-clockwise, swap width and height
# (pdf rotation modifies the coordinate system, so the box always refers to 
# the non-rotated page)
if (rotation // 90) % 2 != 0:
    width, height = height, width

# now you have width and height in points
# 1 point is equivalent to 1/72in (1in -> 2.54cm)

Run Code Online (Sandbox Code Playgroud)

Answer 5

cge*_*901 6

2021-07-22 更新：原始答案并不总是正确的，所以我更新了我的答案。

使用PyMuPDF：

>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc.loadPage(0)
>>> print(page.rect.width, page.rect.height)
842.0 595.0
>>> print(page.mediabox.width, page.mediabox.height)
595.0 842.0

Run Code Online (Sandbox Code Playgroud)

mediabox和rect 的返回值是Rect类型，它具有属性“width”和“height”。mediabox 和 rect 的区别之一是 mediabox 与文档中的 /MediaBox 相同，并且在页面旋转时不会改变。但是，rect 受旋转的影响。有关 PyMuPDF 中不同框的更多信息，您可以阅读词汇表。

除了“doc.loadPage(0)”之外，您还可以简单地编写“doc[0]”:-) (2认同)
这是读取 PDF 文件最快的包装库 (2认同)

归档时间：	14 年，7 月前
查看次数：	13620 次
最近记录：	6 年，3 月前