如何使用golang从pdf中提取文本?

dha*_*0us 1 pdf text-extraction go

我正在尝试从 golang 中的 pdf 文件中提取文本。请参阅下面的代码。由于某种原因,它打印出完整的垃圾(一些随机数)。是pdf。我相信可以提取文本,因为我可以从该文件复制并粘贴文本。

package main

import (
    "bufio"
    "bytes"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "strings"
    pdf "github.com/unidoc/unipdf/v3/model"
)

func main() {
    fmt.Println("Enter URL of PDF file:")
    reader := bufio.NewReader(os.Stdin)
    url, err := reader.ReadString('\n')
    if err != nil {
        log.Fatal(err)
    }
    url = strings.TrimSpace(url)

    // Fetch PDF from URL.
    resp, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()
    buf, _ := ioutil.ReadAll(resp.Body)
    pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
    if err != nil {
        log.Fatal(err)
    }

    // Parse PDF file.
    isEncrypted, err := pdfReader.IsEncrypted()
    if err != nil {
        log.Fatal(err)
    }

    // If PDF is encrypted, exit with message.
    if isEncrypted {
        fmt.Println("Error: PDF is encrypted.")
        os.Exit(1)
    }

    // Get number of pages.
    numPages, err := pdfReader.GetNumPages()
    if err != nil {
        log.Fatal(err)
    }
    // Iterate through pages and print text.
    for i := 1; i <= numPages; i++ {
        page, err := pdfReader.GetPage(i)
        if err != nil {
            log.Fatal(err)
        }
        text, err := page.GetAllContentStreams()
        if err != nil {
            log.Fatal(err)
        }
        fmt.Println(text)
    }
}
Run Code Online (Sandbox Code Playgroud)

Zek*_* Lu 6

我找不到一个免费的、有能力的 Go 包来从 PDF 中提取文本。幸运的是,有一些免费的 CLI 工具可以做到这一点。

pdftotextXpdf一个有前途的选择。查看其输出:

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY'S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75
Run Code Online (Sandbox Code Playgroud)

在 Ubuntu 上,可以使用以下命令安装该工具:

$ sudo apt install poppler-utils
Run Code Online (Sandbox Code Playgroud)

使用以下包可以轻松地从 Go 应用程序执行它exec

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY'S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75
Run Code Online (Sandbox Code Playgroud)