是否可以用Python获取每个单词的边界框？

Question

是否可以用Python获取每个单词的边界框？

我知道

pdftotext -bbox foobar.pdf

Run Code Online (Sandbox Code Playgroud)

创建一个 HTML 文件，其中包含以下内容

<word xMin="301.703800" yMin="104.483700" xMax="309.697000" yMax="115.283700">is</word>
<word xMin="313.046200" yMin="104.483700" xMax="318.374200" yMax="115.283700">a</word>
<word xMin="321.603400" yMin="104.483700" xMax="365.509000" yMax="115.283700">universal</word>
<word xMin="368.858200" yMin="104.483700" xMax="384.821800" yMax="115.283700">file</word>
<word xMin="388.291000" yMin="104.483700" xMax="420.229000" yMax="115.283700">format</word>

Run Code Online (Sandbox Code Playgroud)

因此，每个单词都有一个边界框。

相比之下，Python 包 PDFminer 似乎只能给出文本块的位置（参见示例）。

如何在 Python 中获取每个单词的边界框？

Answer 1

Jor*_*ens 1

免责声明：borb我是本答案中使用的包的作者。

您需要进行某种处理才能获得单词级别的边界框。问题是 PDF（最坏的情况）仅包含渲染指令，而不包含结构信息。

简而言之，您的 PDF 可能包含（以伪代码形式）：

移动到 90、700
将活动字体设置为 Helvetica，大小 12
将活动颜色设置为黑色
以活动字体渲染“Hello World”

问题是指令 3 可能包含以下内容

一个字母
多个字母
一个词，
到多个单词

为了检索单词的边界框，您需要进行一些处理（如前所述）。您将需要渲染这些指令并将文本（最好是在渲染时）拆分为单词。

然后就是跟踪海龟的坐标，然后就可以开始了。

borb为你做这件事（在幕后）。

from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction

# this line builds a RegularExpressionTextExtraction
# this class listens to rendering instructions 
# and performs the logic I mentioned in the text part of this answer
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[^ ]+")

# now we can load the file and perform our processing
with open("input.pdf", "rb") as fh:
    PDF.loads(fh, [l])

# now we just need to get the boxes out of it
# RegularExpressionTextExtraction returns a list of type PDFMatch
# this class can return a list of bounding boxes (should your
# regular expression ever need to be matched over separate lines of text)
for m in l.get_matches_for_page(0):
    # here we just print the Rectangle
    # but feel free to do something useful with it
    print(m.get_bounding_boxes()[0])

Run Code Online (Sandbox Code Playgroud)

borb是一个开源的、纯Python PDF 库，用于创建、修改和读取PDF 文档。您可以使用以下方式下载：

pip install borb

Run Code Online (Sandbox Code Playgroud)

或者，您可以通过分叉/下载GitHub存储库从源代码构建。

归档时间：	8 年，4 月前
查看次数：	1720 次
最近记录：	2 年，11 月前