所以我试图从pdf文件中提取文本,我需要它的位置,宽度,高度,字体.
我尝试了很多,但最有用和最完整的解决方案看起来是PDFMiner,在这种情况下,更确切地说是pdf2txt.py.
我已经按照文档和示例进行操作,并尝试Learn More使用以下命令从我的pdf中提取文本:
pdf2txt.py -Y normal -t xml -o buttons.xml buttons.pdf
Run Code Online (Sandbox Code Playgroud)
输出buttons.xml看起来像这样:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,799.900,449.944" rotate="0">
<textbox id="0" bbox="164.979,213.240,247.680,235.944">
<textline bbox="164.979,213.240,247.680,235.944">
<text font="KZNUUP+HelveticaNeue-Bold" bbox="164.979,213.240,178.978,235.944" size="22.704">(cid:51)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="173.280,213.240,187.278,235.944" size="22.704">(cid:76)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="181.315,213.240,195.313,235.944" size="22.704">(cid:72)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="189.350,213.240,203.348,235.944" size="22.704">(cid:89)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="194.795,213.240,208.793,235.944" size="22.704">(cid:85)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="203.096,213.240,217.094,235.944" size="22.704">(cid:3)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="206.987,213.240,220.986,235.944" size="22.704">(cid:52)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="219.684,213.240,233.682,235.944" size="22.704">(cid:86)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="228.237,213.240,242.235,235.944" size="22.704">(cid:89)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="233.682,213.240,247.680,235.944" size="22.704">(cid:76)</text>
<text></text>
</textline>
</textbox>
<textgroup bbox="164.979,213.240,419.659,235.944"> …Run Code Online (Sandbox Code Playgroud)