使用 PDFminer 将多页 PDF 提取为文本时如何删除页眉和页脚？

Pet*_*ter 5 python text-extraction header footer pdfminer

我已经使用 Python 中的 PDFminer.6 成功从多页 PDF 中提取文本，并将其转换为单个字符串，但我想在将 PDF 提取为文本时删除每个页面的页眉和页脚。

到目前为止类似的问题还没有给我答案。是否有特定的功能可以删除或提取页眉和页脚？我想删除每页的前 7 行和最后 7 行也可以完成这项工作。

希望有人可以帮助我。

def pdf_to_text(pdfname):
# PDFMiner boilerplate
rsrcmgr = PDFResourceManager()
sio = StringIO()
device = TextConverter(rsrcmgr, sio, codec='utf-8', laparams=LAParams(char_margin = 20))
interpreter = PDFPageInterpreter(rsrcmgr, device)

# get text from file
fp = open(pdfname, 'rb')
for page in PDFPage.get_pages(fp):
    interpreter.process_page(page)
fp.close()
# Get text from StringIO
text = sio.getvalue()

# close objects
device.close()
sio.close()

return text

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，3 月前
查看次数：	2851 次
最近记录：	7 年，3 月前