标签: pypdf2

如何在 pyPDF2 中旋转页面？

我正在用 pyPDF2 编辑 PDF 文件。我设法生成了我想要的 PDF，但我还没有旋转一些页面。

我去了文档并找到了两种方法：rotateClockwiseand rotateCounterClockwise，虽然他们说参数是int，但我无法让它工作。Python 说：

TypeError: unsupported operand type(s) for +: 'IndirectObject' and 'int'

Run Code Online (Sandbox Code Playgroud)

要产生此错误：

page = input1.getPage(i)
page.rotateCounterClockwise(90)
output.addPage(page)

Run Code Online (Sandbox Code Playgroud)

我找不到解释程序的人。但是，stackoverflow 中有一个问题，但答案很模糊。

提前致谢。对不起，如果我错过了什么。

python pypdf2

Ign*_*chi

2017 05-23

7
推荐指数

1
解决办法

4261
查看次数

外部参照表未零索引。对象的ID号将被更正。不会继续

我正在尝试打开pdf文件以获取页数。我正在使用PyPDF2。

这是我的代码：

def pdfPageReader(fileName):
    try:
        pdf_file = open(fileName, 'rb')
        read_pdf = PyPDF2.PdfFileReader(pdf_file, strict=True)
        number_of_pages = read_pdf.getNumPages()
        print(str(fileName) + " = " + str(number_of_pages))
        pdf_file.close()
        return number_of_pages
    except:
        return "1"

Run Code Online (Sandbox Code Playgroud)

但是后来我遇到了这个错误：

PdfReadWarning：外部参照表未为零索引。对象的ID号将被更正。[pdf.py:1736]

我尝试使用strict = True和strict = False，当它为True时，它显示此消息，没有任何反应，我等待了30分钟，但什么也没发生。当它为False时，它什么也不显示，仅此而已，什么也不做，如果我在终端（cmd，Windows 10）上按ctrl+ c，然后取消打开并继续（我在一批pdf文件中运行）。批次中只有1个出现此问题。

我的问题是，如何解决此问题，或者如何跳过此问题，或者如何取消此问题并继续使用其他pdf文件？

python-3.x pypdf2

JBi*_*Bin

2019 04-15

7
推荐指数

3
解决办法

5785
查看次数

提取PDF的特定页面并用Python保存

我有一些来源并尝试编写代码来提取一些页面并创建 pdf 文件。我有一个看起来像这样的列表

information = [(filename1,startpage1,endpage1), (filename2, startpage2, endpage2), ...,(filename19,startpage19,endpage19)].

Run Code Online (Sandbox Code Playgroud)

这是我的代码。

import PyPDF2    
for page in range(pdfReader.getNumPages()-1):
    pdf_writer = PyPDF2.PdfFileWriter()
    start = information[page][1]
    end = information[page][2]
    while start<end:
        pdf_writer.addPage(pdfReader.getPage(start))
        start+=1
        output_filename = '{}_{}_page_{}.pdf'.format(information[page][0],information[page][1], information[page][2])
    with open(output_filename,'wb') as out:
        pdf_writer.write(out)

Run Code Online (Sandbox Code Playgroud)

但是输出很奇怪……有些里面什么都没有，有些里面只有一页。我该如何纠正？

python pdf extract pypdf2

SSS*_*SSS

lucky-day

7
推荐指数

2
解决办法

8716
查看次数

PyPDF2：串联内存中的pdf

我希望在纯python的内存中有效地串联（附加）一堆小pdf。具体来说，通常情况是将500张单页pdf合并为一个，每个pdf大小约为400 kB。假设pdf可作为内存中的可迭代对象使用，例如一个列表：

my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj]  # type is BytesIO

Run Code Online (Sandbox Code Playgroud)

其中每个pdf_fileobj均为BytesIO类型。然后，基本内存使用量约为200 MB（500 pdfs，每个400kB）。

理想情况下，我希望以下代码总共使用不超过400-500 MB的内存（包括my_pdfs）进行连接。但是，情况似乎并非如此，最后一行的调试语句表明以前的最大内存接近700 MB。此外，使用Mac os x资源监视器，当到达最后一行时，分配的内存指示为600 MB。

运行gc.collect()将其减少到350 MB（几乎太好了？）。在这种情况下，为什么我必须手动运行垃圾收集来摆脱合并垃圾？我已经（可能）看到了这种情况，可能会导致内存积聚，但情况略有不同，我将略过。

import PyPDF2
import io
import resources  # For debugging

def merge_pdfs(iterable):
    ''' Merge pdfs in memory '''
    merger = PyPDF2.PdfFileMerger()
    for pdf_fileobj in iterable:
        merger.append(pdf_fileobj)

    myio = io.BytesIO()
    merger.write(myio)
    merger.close()

    myio.seek(0)
    return myio

my_concatenated_pdf = merge_pdfs(my_pdfs)

# Print the maximum memory usage
print('Memory usage: %s (kB)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Run Code Online (Sandbox Code Playgroud)

问题总结

为什么上面的代码需要近700 MB的内存来合并200 MB的pdf文件？400 MB …

python memory pdf pypdf2

And*_*eas

lucky-day

6
推荐指数

1
解决办法

904
查看次数

Python PDF直接阅读它在PDF中的外观

如果我在答案中使用代码: 在Python中使用PDFMiner从PDF文件中提取文本？

我可以在申请PDF格式时提取文本:https://www.tencent.com/en-us/articles/15000691526464720.pdf

但是,您在"合并收入报表"下看到,它会读取...即... Revenues VAS Online advertising然后它会读取数字...我希望它能够读取,即:

Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 等等......有没有办法做到这一点？

寻找其他可能的解决方案pdfminer.

如果我尝试使用此代码,PyPDF2并非所有文本都出现:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open(file, 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
a=(pdfReader.numPages)

# creating a page object
for i in range(0,a):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())

Run Code Online (Sandbox Code Playgroud)

python pdf pdfminer pypdf2

jas*_*son

2018 07-25

6
推荐指数

1
解决办法

660
查看次数

使用 PyPDF2 在 PDF 上去除水印

本节从 PyPDF2 库中导入必要的类

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_

>The watermark says SAMPLE on it so I've tried different capitalization cases 
wm_text = 'Sample'
replace_with = ''
>I'm hoping to just replace the SAMPLE watermark with nothing so a space could suffice

> Load PDF into pyPDF
source = PdfFileReader(open('input.pdf', "rb"))
output = PdfFileWriter()

> For each page
for page in range(source.getNumPages()):
    # Get the current page and …

Run Code Online (Sandbox Code Playgroud)

python pdf watermark pypdf2

Sha*_* G.

lucky-day

5
推荐指数

1
解决办法

6617
查看次数

有没有办法关闭 PdfFileReader 打开的文件？

我打开了很多 PDF，我想在解析后删除这些 PDF，但文件在程序运行完成之前保持打开状态。如何关闭使用 PyPDF2 打开的 PDF？

代码：

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = PyPDF2.PdfFileReader(file(path, "rb"))

    #Check for number of pages, prevents out of bounds errors
    max = 0
    if pdf.numPages > 3:
        max = 3
    else:
        max = (pdf.numPages - 1)

    # Iterate pages
    for i in range(0, max): 
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    #pdf.close()
    return …

Run Code Online (Sandbox Code Playgroud)

python python-2.7 pypdf2

SPY*_*G96

2017 10-31

5
推荐指数

1
解决办法

7138
查看次数

使用python编辑PDF中的文本

我有一个pdf文件，我需要编辑pdf中的一些文本/值。例如，在我具有“生日DD / MM / YYYY”的pdf中，始终为“ N / A”。我想将其更改为所需的任何值，然后将其另存为新文档。覆盖现有文档也可以。

到目前为止，我以前已经这样做：

import PyPDF2
pdf_obj = open('abc.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_obj)
page = pdf_reader.getPage(0)

writer = PyPDF2.PdfFileWriter()
writer.addPage(pdf_reader.getPage(0))
pdf_doc = writer.updatePageFormFieldValues(pdf_reader.getPage(0), {'BIRTHDAY DD/MM/YYYY': '123'})
outfp = open("new_abc1.pdf", 'wb')
writer.write(outfp)
outfp.close()

Run Code Online (Sandbox Code Playgroud)

但是，此updatePageFormFieldValues（）不会更改所需的值，可能是因为这不是表单字段吗？

pdf屏幕截图，显示了要更改的值

有什么线索吗？

python data-analysis python-2.7 pypdf2

roo*_*kit

2018 06-11

5
推荐指数

1
解决办法

1441
查看次数

PyPDF2从扫描的pdf中提取垂直文本

我正在尝试使用 PyPDF2 从扫描的 pdf 中提取文本。一些 pdf 包含垂直对齐的文本。但是页面的方向是纵向。有什么方法可以使用pdfminer或PyPDF2识别文本是否垂直对齐并读取PDF中的垂直线

python python-3.x pdf-extraction pdfminer pypdf2

Mms*_*Mms

2018 09-27

5
推荐指数

1
解决办法

407
查看次数

PyPDF2：复制 PDF 会产生空白页

我正在使用PyPDF2来更改 PDF 文档（添加书签）。所以我需要读入整个源 PDF 并将其写出来，尽可能多地保持数据完整。仅将每一页写入新的 PDF 对象可能不足以保留文档元数据。

PdfFileWriter()确实有许多复制整个文件的方法：cloneDocumentFromReader,appendPagesFromReader和cloneReaderDocumentRoot. 然而，他们都有问题。

如果我使用cloneDocumentFromReader或appendPagesFromReader，我会得到一个有效的 PDF 文件，页数正确，但所有页面都是空白的。

如果我使用cloneReaderDocumentRoot，我会得到一个最小的有效 PDF 文件，但没有页面或数据。

之前已经问过这个问题，但没有成功的答案。关于PyPDF2 中的空白页的其他问题，但我无法应用给出的答案。

这是我的代码：

def bookmark(incomingFile):
    fileObj = open(incomingFile, 'rb')
    output = PdfFileWriter()
    input = PdfFileReader(fileObj)

    output.appendPagesFromReader(input)
    #output.cloneDocumentFromReader(input)
    myTableOfContents = [
            ('Page 1', 0), 
            ('Page 2', 1),
            ('Page 3', 2)
            ]
    # output.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
    for title, pagenum in myTableOfContents:
        output.addBookmark(title, pagenum, parent=None) …

Run Code Online (Sandbox Code Playgroud)

python pdf pypdf2

ben*_*ggy

2019 04-22

5
推荐指数

1
解决办法

1242
查看次数