将带有列的 PDF 转换为文本

Question

将带有列的 PDF 转换为文本

在 Unix 或 Windows 中，我想将此字典转换为 Python dictionary。我复制了PDF字典的内容并将它们放在一个.rtf文件中，打算read用 Python来处理它们。但是，它给出了类似的东西：

A /e?/ 名词 ABO 系统的人类血型，含有 A 抗原（注意：A 型的人可以捐献给同一组或 AB 组的人，并且可以从具有 A 型的人那里接受血液A 型或 O 型。)
AA
腹胀 /bd?m?n(?)ld?s十?(?)n/ 名词腹部
因气体或液体而伸展的情况
A
腹胀
AA 缩写酗酒者匿名的

它基本上将 PDF 中的列压缩成奇怪的混杂。如何将 PDF 转换为文本以便尊重列？换句话说，所需的输出是：

A /e?/ 名词 ABO 系统的人类血型，含有 A 抗原（注意：A 型的人可以捐献给同一组或 AB 组的人，并且可以从具有 A 型的人那里接受血液A 型或 O 型。)
AA 缩写戒酒匿名

...等等

Answer 1

Kur*_*fle 5

您基本上有两种选择来获取文本：

按原样直接从每个页面提取文本。
将每一页沿列空间分成两部分，分别从每一半中提取文本

对于第一个选项，我建议您先尝试pdftotext，但使用参数-layout. （还有其他工具，例如TET来自 PDFlib 人员的文本提取工具包，如果pdftotext不够好，您可以尝试。）

为了使用 Ghostscript 和其他工具遵循第二个选项的道路，您可能需要查看我对以下问题的回答：

基于 Linux 的工具将 PDF 分成多页（超级用户）
将 PDF 每页 2 面转换为每页 1 面（超级用户）
如何将 PDF 的页面从中间拆分？（超级用户）
使用 Ghostscript 9.01 (Stackoverflow)裁剪 PDF
将一个 PDF 页面一分为二（Stackoverflow）
PDF - 删除白边（Stackoverflow）

`pdftotext -layout`

你可以用命令行工具试试 pdftotext。你必须决定它是否“足够好”来满足你的目的。

以下命令仅从第 8 页（具有双列布局的第一页）中提取文本并将其打印到<stdout>：

$ pdftotext -f 8 -l 8 -layout                                         \
           Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \
 | head -n 30

Run Code Online (Sandbox Code Playgroud)

结果是：

Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM

                                                          A
 A /e?/ noun a human blood type of the ABO                abdominal distension /bd?m?n(?)l d?s
 A                                                        abdominal distension
 system, containing the A antigen (NOTE: Some-              ten?(?)n/ noun a condition in which the abdo-
 one with type A can donate to people of the              men is stretched because of gas or fluid
 same group or of the AB group, and can receive           abdominal pain /b d?m?n(?)l pe?n/ noun
                                                          abdominal pain
 blood from people with type A or type O.)                pain in the abdomen caused by indigestion or
 AA
 AA abbr Alcoholics Anonymous                             more serious disorders
 A & E /e? ?nd  i
                     /, A & E department /e? ?nd           abdominal viscera /bd?m?n(?)l    v?s?r?/
 A & E                                                    abdominal viscera
    i
      d? p?
           tm?nt/ noun same as accident and
                                                          plural noun the organs which are contained in
 emergency department                                     the abdomen, e.g. the stomach, liver and intes-
 A & E medicine /e? ?nd     i
                              med(?)s?n/
 A & E medicine
                                                          tines
                                                          abdominal wall /b d?m?n(?)l w?
                                                                                        l/ noun
                                                          abdominal wall
 noun the medical procedures used in A & E de-                                                            
 partments                                                muscular tissue which surrounds the abdomen
                                                          abdomino- /bd?m?n??/ prefix referring to
                                                          abdomino-

Run Code Online (Sandbox Code Playgroud)

注意使用 -layout！没有它，提取的文本将如下所示：

Medicine.fm 第 1 页 2003 年 11 月 20 日星期四下午 4:26 A A /e?/ 名词 ABO 系统的人类血型，含有 A 抗原（注意：SomeA

A 型可以捐献给同一组或 AB 组的人，并且可以接受 A 型或 O 型人的血液。） AA 缩写酗酒者匿名 A & E /e？?nd i /, A & E 部门 /e? ？我是吗？p? tm?nt/ 名词与急诊科 A & E 医学 /e? ?nd i med(?)s?n/ 名词 A & E deAA 中使用的医疗程序

A & E A & E 药分 AB /e？bi / 名词 ABO 系统的人类血型，包含 A 和 B 抗原 AB

我注意到，该文件使用第8页上，但没有嵌入，字体Courier，Helvetica，Helvetica-Bold，Times-Roman和Times-Italic。

这不会对文本提取造成问题，因为所有这些字体都使用/WinAnsiEncoding.

但是，还有其他字体作为子集嵌入。这些字体确实使用了/Custom编码，但它们不提供/ToUnicode表格。该表对于可靠的文本提取（将字形名称回译为字符名称）是必需的。

我所说的可以在这张表中看到：

$ pdffonts -f 8 -l 8 Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf 
 name                           type        encoding      emb sub uni object ID
 ------------------------------ ----------- ------------- --- --- --- ---------
 Helvetica-Bold                 Type 1      WinAnsi       no  no  no    1505  0
 Courier                        Type 1      WinAnsi       no  no  no    1507  0
 Helvetica                      Type 1      WinAnsi       no  no  no    1497  0
 MOEKLA+Times-PhoneticIPA       Type 1C     Custom        yes yes yes   1509  0
 Times-Roman                    Type 1      WinAnsi       no  no  no    1506  0
 Times-Italic                   Type 1      WinAnsi       no  no  no    1499  0
 IGFBAL+EuropeanPi-Three        Type 1C     Custom        yes yes no    1502  0

Run Code Online (Sandbox Code Playgroud)

碰巧的是，我最近为一个新的 GitHub 项目手工编写了 5 个不同的 PDF 文件，并带有注释的源代码。这 5 个文件展示了对于/ToUnicode作为子集嵌入的每种字体的正确表格的重要性。它们可以在这里找到，以及解释更多细节的自述文件

https://github.com/angea/PDF101/tree/master/handcoded/textextract

归档时间：	10 年，9 月前
查看次数：	9154 次
最近记录：	6 年，3 月前