有人知道在PDF文档中对文本进行矢量化的方法吗?也就是说,我希望每个字母都是形状/轮廓,没有任何文字内容.我使用的是Linux系统,首选开源或非Windows解决方案.
上下文:我正在尝试编辑一些旧PDF,我不再使用这些字体.我想在Inkscape中这样做,但是这会用通用的字体替换所有字体,而且几乎不可读.我也一直在来回转换使用pdf2ps和ps2pdf,但字体信息在那里停留.因此,当我将其加载到Inkscape中时,它看起来仍然很糟糕.
有任何想法吗?谢谢.
要实现这一目标,您必须:
这个答案将省略第3步,因为那是不可编程的.
如果您不想以编程方式分割文档,那么现代的方法就是使用订书机.在你最喜欢的外壳中:
stapler burst file.pdf
Run Code Online (Sandbox Code Playgroud)
会生成{file_1.pdf,...,file_N.pdf},1...NPDF页面在哪里.订书机本身使用PyPDF2,拆分PDF文件的代码并不复杂.以下函数拆分文件并将各个页面保存在当前目录中.(从commands.py文件无耻地复制)
import math
import os
from PyPDF2 import PdfFileWriter, PdfFileReader
def split(filename):
with open(filename) as inputfp:
inputpdf = PdfFileReader(inputfp)
base, ext = os.path.splitext(os.path.basename(filename))
# Prefix the output template with zeros so that ordering is preserved
# (page 10 after page 09)
output_template = ''.join([
base,
'_',
'%0',
str(math.ceil(math.log10(inputpdf.getNumPages()))),
'd',
ext
])
for page in range(inputpdf.getNumPages()):
outputpdf = PdfFileWriter()
outputpdf.addPage(inputpdf.getPage(page))
outputname = output_template % (page + 1)
with open(outputname, 'wb') as fp:
outputpdf.write(fp)
Run Code Online (Sandbox Code Playgroud)
现在要将PDF转换为可编辑文件,我可能会使用pdf2svg.
pdf2svg input.pdf output.svg
Run Code Online (Sandbox Code Playgroud)
如果我们看一下pdf2svg.c文件,我们可以看到代码原则上并不复杂(假设输入文件名在filename变量中,输出文件名在outputname变量中).下面是python中的一个最小工作示例.它需要pycairo和pypoppler库:
import os
import cairo
import poppler
def convert(inputname, outputname):
# Convert the input file name to an URI to please poppler
uri = 'file://' + os.path.abspath(inputname)
pdffile = poppler.document_new_from_file(uri, None)
# We only have one page, since we split prior to converting. Get the page
page = pdffile.get_page(0)
# Get the page dimensions
width, height = page.get_size()
# Open the SVG file to write on
surface = cairo.SVGSurface(outputname, width, height)
context = cairo.Context(surface)
# Now we finally can render the PDF to SVG
page.render_for_printing(context)
context.show_page()
Run Code Online (Sandbox Code Playgroud)
此时,您应该有一个SVG,其中所有文本都已转换为路径,并且能够使用Inkscape进行编辑而不会出现渲染问题.
你可以调用pdf2svgfor循环来做到这一点.但是你需要预先知道页数.下面的代码显示了页数,并且只需一步即可完成转换.它只需要pycairo和pypoppler:
import os, math
import cairo
import poppler
def convert(inputname, base=None):
'''Converts a multi-page PDF to multiple SVG files.
:param inputname: Name of the PDF to be converted
:param base: Base name for the SVG files (optional)
'''
if base is None:
base, ext = os.path.splitext(os.path.basename(inputname))
# Convert the input file name to an URI to please poppler
uri = 'file://' + os.path.abspath(inputname)
pdffile = poppler.document_new_from_file(uri, None)
pages = pdffile.get_n_pages()
# Prefix the output template with zeros so that ordering is preserved
# (page 10 after page 09)
output_template = ''.join([
base,
'_',
'%0',
str(math.ceil(math.log10(pages))),
'd',
'.svg'
])
# Iterate over all pages
for nthpage in range(pages):
page = pdffile.get_page(nthpage)
# Output file name based on template
outputname = output_template % (nthpage + 1)
# Get the page dimensions
width, height = page.get_size()
# Open the SVG file to write on
surface = cairo.SVGSurface(outputname, width, height)
context = cairo.Context(surface)
# Now we finally can render the PDF to SVG
page.render_for_printing(context)
context.show_page()
# Free some memory
surface.finish()
Run Code Online (Sandbox Code Playgroud)
要重新组合,您可以使用inkscape/stapler对手动转换文件.但编写执行此操作的代码并不难.下面的代码使用rsvg和cairo.要从SVG转换并将所有内容合并为单个PDF:
import rsvg
import cairo
def convert_merge(inputfiles, outputname):
# We have to create a PDF surface and inform a size. The size is
# irrelevant, though, as we will define the sizes of each page
# individually.
outputsurface = cairo.PDFSurface(outputname, 1, 1)
outputcontext = cairo.Context(outputsurface)
for inputfile in inputfiles:
# Open the SVG
svg = rsvg.Handle(file=inputfile)
# Set the size of the page itself
outputsurface.set_size(svg.props.width, svg.props.height)
# Draw on the PDF
svg.render_cairo(outputcontext)
# Finish the page and start a new one
outputcontext.show_page()
# Free some memory
outputsurface.finish()
Run Code Online (Sandbox Code Playgroud)
PS:应该可以使用该命令pdftocairo,但它似乎没有调用render_for_printing(),这使得输出SVG保持字体信息.
| 归档时间: |
|
| 查看次数: |
2419 次 |
| 最近记录: |