使用python查找搜索字符串在pdf文档中的哪个页面

Question

使用python查找搜索字符串在pdf文档中的哪个页面

use*_*144 5 python pdf pypdf

我可以使用哪些python包来找出特定"搜索字符串"所在的页面？

我查看了几个python pdf包但无法弄清楚应该使用哪一个. PyPDF似乎没有这个功能,而PDFMiner似乎对这样简单的任务来说太过分了.有什么建议？

更精确:我有几个PDF文档,我想提取字符串"Begin"和字符串"End"之间的页面.

Answer 1

use*_*144 15

我终于发现pyPDF可以提供帮助.我发布它,以防它可以帮助别人.

(1)定位字符串的函数

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

Run Code Online (Sandbox Code Playgroud)

(2)提取感兴趣的页面的功能

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

Run Code Online (Sandbox Code Playgroud)

我希望这对其他人有帮助

Answer 2

dat*_*ght 3

使用python查找搜索字符串位于pdf文档中的哪一页

pyPDF2

 # import packages
    import PyPDF2
    import re
    
    # open the pdf file
    object = PyPDF2.PdfFileReader(r"source_file_path")
    
    # get number of pages
    NumPages = object.getNumPages()
    
    # define keyterms
    String = "P4F-21B"
    
    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        ResSearch = re.search(String, Text)
        if ResSearch != None:
            print(ResSearch)
            print("Page Number" + str(i+1))

Run Code Online (Sandbox Code Playgroud)

输出：

<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1

Run Code Online (Sandbox Code Playgroud)

PyMuPDF

import fitz
import re

# load document
doc = fitz.open(r"C:\Users\shraddha.shetty\Desktop\OCR-pages-deleted.pdf")

# define keyterms
String = "P4F-21B"

# get text, search for string and print count on page.
for page in doc:
    text = ''
    text += page.get_text()
    if len(re.findall(String, text)) > 0:
        print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，3 月前
查看次数：	6605 次
最近记录：	6 年，6 月前