Jam*_*ard 8 pdf python-3.x python-requests
我正在尝试从澳大利亚证券交易所网站创建一个pdf取件器,这将允许我搜索公司制作的所有"公告",并在这些公告的pdf中搜索关键词.
到目前为止,我正在使用请求和PyPDF2获取PDF文件,将其写入我的驱动器然后读取它.但是,我希望能够跳过将PDF文件写入驱动器并读取它的步骤,并直接从获取PDF文件转换为字符串.到目前为止我所拥有的是:
import requests, PyPDF2
url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content
with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(my_raw_data)
open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()
ann_text = []
for page_num in range(num_pages):
if read_pdf.isEncrypted:
read_pdf.decrypt("")
print(read_pdf.getPage(page_num).extractText())
page_text = read_pdf.getPage(page_num).extractText().split()
ann_text.append(page_text)
else:
print(read_pdf.getPage(page_num).extractText())
print(ann_text)
Run Code Online (Sandbox Code Playgroud)
这将从提供的URL打印PDF文件中的字符串列表.
只是想知道我是否可以将my_raw_data变量转换为可读字符串?
非常感谢提前!
Maa*_*bré 12
你可以使用io
import requests, PyPDF2, io
url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()
print(num_pages)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)2
PS.要打开文件,请始终使用上下文管理器(with-statement)
试试这个(使用 IO 模块和额外的解密器):
import requests, PyPDF2, io
url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url).content
reserve_pdf_on_memory = io.BytesIO(response)
load_pdf = PyPDF2.PdfFileReader(reserve_pdf_on_memory)
if load_pdf.isEncrypted:
load_pdf.decrypt("")
print(load_pdf.getPage(0).extractText())
else:
print(load_pdf.getPage(0).extractText())
Run Code Online (Sandbox Code Playgroud)
祝你好运 ... :)
| 归档时间: |
|
| 查看次数: |
2561 次 |
| 最近记录: |