e.i*_*luf 7 python pdf node.js pdf-scraping
我有数千个 pdf 文件需要从中提取数据。这是一个pdf示例。我想从示例 pdf 中提取此信息。
我对 Nodejs、Python 或任何其他有效的方法持开放态度。我对python和nodejs了解甚少。我尝试使用 python 与此代码
import PyPDF2
try:
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageNumber = pdfReader.numPages
page = pdfReader.getPage(0)
print(pageNumber)
pagecontent = page.extractText()
print(pagecontent)
except Exception as e:
print(e)Run Code Online (Sandbox Code Playgroud)
但我陷入了如何查找采购历史记录的困境。从 pdf 中提取采购历史记录的最佳方法是什么?
小智 2
很久以前我就做过类似的事情来刷成绩。我发现的最简单(不太漂亮)的解决方案是将 pdf 转换为 html,然后解析 html。
为此,我使用了 pdf2text/pdf2html ( https://pypi.org/project/pdf-tools/ ) 和 html。
我也使用了编解码器,但不记得其背后的原因。
一个快速而肮脏的总结:
from lxml import html
import codecs
import os
# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
# Do things
file.close()
Run Code Online (Sandbox Code Playgroud)
只需检查 pdf2text 或 pdf2html 的结果,然后使用 xpath 您应该可以轻松提取信息。
我希望它有帮助!
编辑:评论代码
EDIT2:以下代码正在打印您的数据
# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")
from lxml import html
import codecs
import os
file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
line_elements = elm.split()
# Just observed that what you need starts with a number
if len(line_elements) > 0 and line_elements[0].isdigit():
print(line_elements)
file.close();
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
30317 次 |
| 最近记录: |