标签: pypdf2

如何使用Scrapy在线解析PDF页面？

我尝试使用带有 PyPDF2 库的 Scrapy 在线抓取 PDf，但未成功。到目前为止，我能够浏览所有链接并能够获取 PDf 文件，但是通过 PyPDF2 提供它们似乎是一个问题。

注意：我的目标不是抓取/保存 PDF 文件，我打算通过首先将 PDF 转换为文本然后使用其他方法处理此文本来解析它们。

为简洁起见，我没有在此处包含完整代码。这是我的代码的一部分：

import io
import re
import PyPDF2
import scrapy
from scrapy.item import Item

class ArticleSpider(scrapy.Spider):
    name = "spyder_ARTICLE"                                                 
    start_urls = ['https://legion-216909.appspot.com/content.htm']                                                                      

    def parse(self, response):                                              
        for article_url in response.xpath('//div//a/@href').extract():      
            yield response.follow(article_url, callback=self.parse_pdf) 

    def parse_pdf(self, response):
        """ Peek inside PDF to check for targets.
        @return: PDF content as searcable plain-text string
        """
        reader = PyPDF2.PdfFileReader(response.body)
        text = u""

        # Title is optional, may be None
        if …

Run Code Online (Sandbox Code Playgroud)

python scrapy pypdf2

Cod*_*key

2018 09-26

2
推荐指数

1
解决办法

1861
查看次数

“导入错误：无法从部分初始化的模块 'pdf2image' 中导入名称 'convert_from_path'（很可能是由于循环导入）”

使用 pdf2image 模块时出现错误：

from pdf2image import convert_from_path pages = convert_from_path('mypdf', 500) for page in pages: page.save('out.jpg', 'JPEG')

image python-3.x pypdf2

Dhu*_*ena

lucky-day

2
推荐指数

1
解决办法

4297
查看次数

如何使用 PyPDF2 获取 Pdf 方向

我正在使用 Python/Django。
PyPDF2 来阅读我当前的 pdf。

我想阅读我保存的 pdf 并获取 pdf 中单个页面的方向。

我希望能够确定页面是横向还是纵向。

tempoutpdffilelocation =  settings.TEMPLATES_ROOT + nameOfFinalPdf
pageOrientation = pageToEdit.mediaBox
pdfOrientation = PdfFileReader(file(temppdffilelocation, "rb"))
# tempPdfOrientationPage = pdfOrientation.getPage(numberOfPageToEdit).mediaBox
print("existing pdf width: ")
# print(existing_pdf.getPage(numberOfPageToEdit).getWidth)
# print("get page size with rotation")
# print(tempPdfOrientationPage.getPageSizeWithRotation) 

existing_pdf = pdfOrientation.getPage(numberOfPageToEdit).mediaBox
# print(pageOrientation)
if pageOrientation.getUpperRight_x() - pageOrientation.getUpperLeft_x() > pageOrientation.getUpperRight_y() - pageOrientation.getLowerRight_y():
  print('Landscape')
  print(pageOrientation)
  # print(pdfOrientation.getWidth())
else:
  print('Portrait')
  print(pageOrientation)
  # print(pdfOrientation.getWidth())
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)

Run Code Online (Sandbox Code Playgroud)

最后一行设置 pagesize=letter 我想根据我当前的 pdf 确定什么。
这是我的进口： …

python pdf django reportlab pypdf2

Jon*_*edy

lucky-day

1
推荐指数

1
解决办法

5745
查看次数

Python 3从Web解析PDF

我试图从网页上获取PDF，进行解析，然后使用PyPDF2将结果打印到屏幕上。我使用以下代码正常工作：

with open("foo.pdf", "wb") as f:
    f.write(requests.get(buildurl(jornal, date, page)).content)
pdfFileObj = open('foo.pdf', "rb")
pdf_reader = PyPDF2.PdfFileReader(pdfFileObj)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())

Run Code Online (Sandbox Code Playgroud)

只是写一个文件，尽管听起来很浪费，但我仍然可以读取它，所以我想我会这样切掉中间人：

pdf_reader = PyPDF2.PdfFileReader(requests.get(buildurl(jornal, date, page)).content)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())

Run Code Online (Sandbox Code Playgroud)

但是，这给了我一个AttributeError: 'bytes' object has no attribute 'seek'。如何将requests直接来自PyPDF2 的PDF 送入？

python pdf python-requests pypdf2

Ber*_*rer

2016 07-31

1
推荐指数

1
解决办法

1558
查看次数

在python3中通过PDF编写文本

我试图在某个位置将一些字符串写入PDF文件.我找到了一种方法来实现它并像这样实现:

from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

packet = io.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(10, 100, "Hello world")
can.save()

#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file("original.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# finally, write …

Run Code Online (Sandbox Code Playgroud)

python pdf canvas python-3.x pypdf2

waq*_*ard

2018 10-31

1
推荐指数

1
解决办法

2261
查看次数

PyPDF2 PdfFileMerger 在合并文件中丢失 PDF 模块

我正在将 PDF 文件与 PyPDF2 合并，但是，当其中一个文件包含一个充满数据的 PDF 模块（典型的应用程序填充的 PDF）时，在合并的文件中，该模块为空，不显示任何数据。

这是我用来合并 PDF 的两种方法：

def merge_pdf_files(pdf_files, i):
    pdf_merger = PdfFileMerger(strict=False)
    for pdf in pdf_files:
        pdf_merger.append(pdf)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    pdf_merger.write(output_filename)

def merge_pdf_files2(pdf_files, i):
    output = PdfFileWriter()
    for pdf in pdf_files:
        input = PdfFileReader(pdf)
        for page in input.pages:
            output.addPage(page)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    with open(output_filename,'wb') as output_stream:
        output.write(output_stream)

Run Code Online (Sandbox Code Playgroud)

我希望最终合并的 PDF 显示在 PDF 模块中填写的所有数据。或者，或者，有人可以将我指向另一个没有遭受此（外观）错误的 Python 库。谢谢

更新我也尝试了 PyMuPDF，结果相同。

def merge_pdf_files4(pdf_files, i):
    output = fitz.open()
    for pdf in pdf_files:
        input = fitz.open(pdf)
        output.insertPDF(input) …

Run Code Online (Sandbox Code Playgroud)

python pdf pdfa pypdf2

A_E*_*A_E

2019 07-16

1
推荐指数

1
解决办法

872
查看次数

How to extract text from pdf in python 3.7.3

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.

What is currently the best and easiest way to extract text from a PDF file into a …

python pdf pdf-extraction pypdf2

RaV*_*LLi

2019 04-20

-1
推荐指数

3
解决办法

5849
查看次数