将PDF文本转换为轮廓?

Ada*_*ith 3 pdf inkscape

有人知道在PDF文档中对文本进行矢量化的方法吗?也就是说,我希望每个字母都是形状/轮廓,没有任何文字内容.我使用的是Linux系统,首选开源或非Windows解决方案.

上下文:我正在尝试编辑一些旧PDF,我不再使用这些字体.我想在Inkscape中这样做,但是这会用通用的字体替换所有字体,而且几乎不可读.我也一直在来回转换使用pdf2psps2pdf,但字体信息在那里停留.因此,当我将其加载到Inkscape中时,它看起来仍然很糟糕.

有任何想法吗?谢谢.

ren*_*toc 6

要实现这一目标,您必须:

  1. 将PDF拆分为单独的页面;
  2. 将您的PDF页面转换为SVG;
  3. 编辑所需的页面
  4. 重新组装页面

这个答案将省略第3步,因为那是不可编程的.

拆分PDF

如果您不想以编程方式分割文档,那么现代的方法就是使用订书机.在你最喜欢的外壳中:

stapler burst file.pdf
Run Code Online (Sandbox Code Playgroud)

会生成{file_1.pdf,...,file_N.pdf},1...NPDF页面在哪里.订书机本身使用PyPDF2,拆分PDF文件的代码并不复杂.以下函数拆分文件并将各个页面保存在当前目录中.(从commands.py文件无耻地复制)

import math
import os
from PyPDF2 import PdfFileWriter, PdfFileReader

def split(filename):
    with open(filename) as inputfp:
        inputpdf = PdfFileReader(inputfp)

        base, ext = os.path.splitext(os.path.basename(filename))

        # Prefix the output template with zeros so that ordering is preserved
        # (page 10 after page 09)
        output_template = ''.join([
            base,
            '_',
            '%0',
            str(math.ceil(math.log10(inputpdf.getNumPages()))),
            'd',
            ext
        ])

        for page in range(inputpdf.getNumPages()):
            outputpdf = PdfFileWriter()
            outputpdf.addPage(inputpdf.getPage(page))

            outputname = output_template % (page + 1)

            with open(outputname, 'wb') as fp:
                outputpdf.write(fp)
Run Code Online (Sandbox Code Playgroud)

将各个页面转换为SVG

现在要将PDF转换为可编辑文件,我可能会使用pdf2svg.

pdf2svg input.pdf output.svg
Run Code Online (Sandbox Code Playgroud)

如果我们看一下pdf2svg.c文件,我们可以看到代码原则上并不复杂(假设输入文件名在filename变量中,输出文件名在outputname变量中).下面是python中的一个最小工作示例.它需要pycairopypoppler库:

import os

import cairo
import poppler

def convert(inputname, outputname):
    # Convert the input file name to an URI to please poppler
    uri = 'file://' + os.path.abspath(inputname)

    pdffile = poppler.document_new_from_file(uri, None)

    # We only have one page, since we split prior to converting. Get the page
    page = pdffile.get_page(0)

    # Get the page dimensions
    width, height = page.get_size()

    # Open the SVG file to write on
    surface = cairo.SVGSurface(outputname, width, height)
    context = cairo.Context(surface)

    # Now we finally can render the PDF to SVG
    page.render_for_printing(context)
    context.show_page()
Run Code Online (Sandbox Code Playgroud)

此时,您应该有一个SVG,其中所有文本都已转换为路径,并且能够使用Inkscape进行编辑而不会出现渲染问题.

结合步骤1和2

你可以调用pdf2svgfor循环来做到这一点.但是你需要预先知道页数.下面的代码显示了页数,并且只需一步即可完成转换.它只需要pycairo和pypoppler:

import os, math

import cairo
import poppler

def convert(inputname, base=None):
    '''Converts a multi-page PDF to multiple SVG files.

    :param inputname: Name of the PDF to be converted
    :param base: Base name for the SVG files (optional)
    '''
    if base is None:
        base, ext = os.path.splitext(os.path.basename(inputname))

    # Convert the input file name to an URI to please poppler
    uri = 'file://' + os.path.abspath(inputname)

    pdffile = poppler.document_new_from_file(uri, None)

    pages = pdffile.get_n_pages()

    # Prefix the output template with zeros so that ordering is preserved
    # (page 10 after page 09)
    output_template = ''.join([
        base,
        '_',
        '%0',
        str(math.ceil(math.log10(pages))),
        'd',
        '.svg'
    ])

    # Iterate over all pages
    for nthpage in range(pages):
        page = pdffile.get_page(nthpage)

        # Output file name based on template
        outputname = output_template % (nthpage + 1)

        # Get the page dimensions
        width, height = page.get_size()

        # Open the SVG file to write on
        surface = cairo.SVGSurface(outputname, width, height)
        context = cairo.Context(surface)

        # Now we finally can render the PDF to SVG
        page.render_for_printing(context)
        context.show_page()

        # Free some memory
        surface.finish()
Run Code Online (Sandbox Code Playgroud)

将SVG组装成单个PDF

要重新组合,您可以使用inkscape/stapler对手动转换文件.但编写执行此操作的代码并不难.下面的代码使用rsvg和cairo.要从SVG转换并将所有内容合并为单个PDF:

import rsvg
import cairo

def convert_merge(inputfiles, outputname):
    # We have to create a PDF surface and inform a size. The size is
    # irrelevant, though, as we will define the sizes of each page
    # individually.
    outputsurface = cairo.PDFSurface(outputname, 1, 1)
    outputcontext = cairo.Context(outputsurface)

    for inputfile in inputfiles:
        # Open the SVG
        svg = rsvg.Handle(file=inputfile)

        # Set the size of the page itself
        outputsurface.set_size(svg.props.width, svg.props.height)

        # Draw on the PDF
        svg.render_cairo(outputcontext)

        # Finish the page and start a new one
        outputcontext.show_page()

    # Free some memory
    outputsurface.finish()
Run Code Online (Sandbox Code Playgroud)

PS:应该可以使用该命令pdftocairo,但它似乎没有调用render_for_printing(),这使得输出SVG保持字体信息.