我正在编写邮件合并软件作为 Python Web 应用程序的一部分。
我有一个名为 的模板letter.pdf,它是从 MS Word 文件生成的,其中包含文本 {name},其中将包含居民的姓名。我还有一份 c 的清单。100 个居民的姓名。
我想做的是读入letter.pdf搜索"{name}"并将其替换为居民的姓名(对于每个居民),然后将结果写入另一个pdf。然后,我想将所有这些 pdf 收集到一个大 pdf(每个字母一页)中,我的网络应用程序的用户将打印出来以创建他们的字母。
有没有任何 Python 库可以做到这一点?我看过 pdfrw 和 pdfminer 但我看不出他们能够在哪里做到这一点。
(注意:我还有 MS Word 文件,所以如果有另一种使用它的方法,而不是通过 pdf,那也可以完成这项工作。)
Dmy*_*tro 11
这可以使用 PyPDF2 包来完成。实现可能取决于原始 PDF 模板结构。但是,如果模板足够稳定并且不经常更改,则替换代码不应是通用的,而应相当简单。
我画了一个关于如何替换PDF 文件中文本的小草图。它将所有出现的PDF标记替换为DOC。
import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject
def replace_text(content, replacements = dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = line
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(object, replacements):
data = object.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if object.decodedSelf is not None:
object.decodedSelf.setData(encoded_data)
else:
object.setData(encoded_data)
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to PDF document")
args = vars(ap.parse_args())
in_file = args["input"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
# Provide replacements list that you need here
replacements = { 'PDF': 'DOC'}
pdf = PdfFileReader(in_file)
writer = PdfFileWriter()
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
process_data(contents, replacements)
elif len(contents) > 0:
for obj in contents:
if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
streamObj = obj.getObject()
process_data(streamObj, replacements)
writer.addPage(page)
with open(filename_base + ".result.pdf", 'wb') as out_file:
writer.write(out_file)
Run Code Online (Sandbox Code Playgroud)
结果是
2021 年 3 月 21 日更新:
更新了要处理的代码示例DecodedStreamObject,EncodedStreamObject该示例实际上包含要更新的文本的数据流。
Vla*_*ior 10
Dymitrio 更新的代码示例用于处理 DecodedStreamObject 和 EncodedStreamObject,它们实际上包含带有要更新的文本的数据流,可以正常运行,但使用与示例不同的文件时,无法更改 pdf 文本内容。
根据编辑 3,来自如何使用 Python 替换 PDF 中的文本?:
page[NameObject("/Contents")] = contents.decodedSelf通过在 before插入writer.addPage(page),我们强制 pyPDF2 更新页面对象的内容。
这样我就能够克服这个问题并替换 pdf 文件中的文本。
最终代码应如下所示:
import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject, NameObject
def replace_text(content, replacements = dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = line
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(object, replacements):
data = object.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if object.decodedSelf is not None:
object.decodedSelf.setData(encoded_data)
else:
object.setData(encoded_data)
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to PDF document")
args = vars(ap.parse_args())
in_file = args["input"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
# Provide replacements list that you need here
replacements = { 'PDF': 'DOC'}
pdf = PdfFileReader(in_file)
writer = PdfFileWriter()
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
process_data(contents, replacements)
elif len(contents) > 0:
for obj in contents:
if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
streamObj = obj.getObject()
process_data(streamObj, replacements)
# Force content replacement
page[NameObject("/Contents")] = contents.decodedSelf
writer.addPage(page)
with open(filename_base + ".result.pdf", 'wb') as out_file:
writer.write(out_file)
Run Code Online (Sandbox Code Playgroud)
重要的: from PyPDF2.generic import NameObject
pdftk original.pdf output uncompressed.pdf uncompress
Run Code Online (Sandbox Code Playgroud)
from PyPDF2 import PdfFileReader, PdfFileWriter
replacements = [
("old string", "new string")
]
pdf = PdfFileReader(open("uncompressed.pdf", "rb"))
writer = PdfFileWriter()
for page in pdf.pages:
contents = page.getContents().getData()
for (a,b) in replacements:
contents = contents.replace(a.encode('utf-8'), b.encode('utf-8'))
page.getContents().setData(contents)
writer.addPage(page)
with open("modified.pdf", "wb") as f:
writer.write(f)
Run Code Online (Sandbox Code Playgroud)
pdftk modified.pdf output recompressed.pdf compress
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
43336 次 |
| 最近记录: |