dha*_*0us 1 pdf text-extraction go
我正在尝试从 golang 中的 pdf 文件中提取文本。请参阅下面的代码。由于某种原因,它打印出完整的垃圾(一些随机数)。这是pdf。我相信可以提取文本,因为我可以从该文件复制并粘贴文本。
package main
import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"strings"
pdf "github.com/unidoc/unipdf/v3/model"
)
func main() {
fmt.Println("Enter URL of PDF file:")
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString('\n')
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// Fetch PDF from URL.
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// Parse PDF file.
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// If PDF is encrypted, exit with message.
if isEncrypted {
fmt.Println("Error: PDF is encrypted.")
os.Exit(1)
}
// Get number of pages.
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// Iterate through pages and print text.
for i := 1; i <= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}
Run Code Online (Sandbox Code Playgroud)
我找不到一个免费的、有能力的 Go 包来从 PDF 中提取文本。幸运的是,有一些免费的 CLI 工具可以做到这一点。
pdftotextXpdf是一个有前途的选择。查看其输出:
$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
ALL INDIA TENNIS ASSOCIATION
As on 24TH April , 2023
BOY'S UNDER-12 2011 BEST BEST 25% BEST POINTS
24TH April , 2023 Eight Eight Eight CUT FOR TTL.
SING. DBLS. DBLS. NO SHOW PTS.
RANK NAME OF PLAYER REG NO. DOB STATE PTS. PTS. PTS. LATE WL Final
1 VIVAAN MIRDHA 432735 08-Apr-11 (RJ) 485 565 141.25 0 797
2 SMIT SACHIN UNDRE 437763 07-Feb-11 (MH) 435 480 120 0 664.25
3 RISHIKESH MANE 436806 15-Jan-11 (MH) 420 380 95 0 619
4 VIRAJ CHOUDHARY 436648 03-Feb-11 (DL) 415 420 105 0 598.75
Run Code Online (Sandbox Code Playgroud)
在 Ubuntu 上,可以使用以下命令安装该工具:
$ sudo apt install poppler-utils
Run Code Online (Sandbox Code Playgroud)
使用以下包可以轻松地从 Go 应用程序执行它exec:
$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
ALL INDIA TENNIS ASSOCIATION
As on 24TH April , 2023
BOY'S UNDER-12 2011 BEST BEST 25% BEST POINTS
24TH April , 2023 Eight Eight Eight CUT FOR TTL.
SING. DBLS. DBLS. NO SHOW PTS.
RANK NAME OF PLAYER REG NO. DOB STATE PTS. PTS. PTS. LATE WL Final
1 VIVAAN MIRDHA 432735 08-Apr-11 (RJ) 485 565 141.25 0 797
2 SMIT SACHIN UNDRE 437763 07-Feb-11 (MH) 435 480 120 0 664.25
3 RISHIKESH MANE 436806 15-Jan-11 (MH) 420 380 95 0 619
4 VIRAJ CHOUDHARY 436648 03-Feb-11 (DL) 415 420 105 0 598.75
Run Code Online (Sandbox Code Playgroud)