Joe*_*nin 5 python pdf string unicode pypdf
我正在尝试将PDF的每个页面提取为字符串:
import pyPdf
pages = []
pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb'))
for i in range(0, pdf.getNumPages()):
this_page = pdf.getPage(i).extractText() + "\n"
this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split())
pages.append(this_page.encode("ascii", "xmlcharrefreplace"))
for page in pages:
print '*' * 80
print page
Run Code Online (Sandbox Code Playgroud)
但是这个脚本忽略了换行符,让我看起来像乱码information concerning an individual which, because of name, identifyingnumber, mark or description
(比如,这应该读identifying number
,不是identifyingumber
).
我对PDF编码了解不多,但我认为您可以通过修改来解决您的特定问题pdf.py
.在该PageObject.extractText
方法中,您可以看到发生了什么:
def extractText(self):
[...]
for operands,operator in content.operations:
if operator == "Tj":
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == "T*":
text += "\n"
elif operator == "'":
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == '"':
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == "TJ":
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
Run Code Online (Sandbox Code Playgroud)
如果运算符是Tj
或TJ
(在示例PDF中为Tj),则只是附加文本并且不添加换行符.现在你不一定要添加换行符,至少如果我正在阅读PDF参考权限:Tj/TJ
只是单个和多个show-string运算符,并且某种类型的分隔符的存在不是强制性的.
无论如何,如果你修改这个代码是这样的
def extractText(self, Tj_sep="", TJ_sep=""):
Run Code Online (Sandbox Code Playgroud)
[...]
if operator == "Tj":
_text = operands[0]
if isinstance(_text, TextStringObject):
text += Tj_sep
text += _text
Run Code Online (Sandbox Code Playgroud)
[...]
elif operator == "TJ":
for i in operands[0]:
if isinstance(i, TextStringObject):
text += TJ_sep
text += i
Run Code Online (Sandbox Code Playgroud)
那么默认行为应该是相同的:
In [1]: pdf.getPage(1).extractText()[1120:1250]
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'
Run Code Online (Sandbox Code Playgroud)
但是你可以在你想要的时候改变它:
In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '
Run Code Online (Sandbox Code Playgroud)
要么
In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250]
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '
Run Code Online (Sandbox Code Playgroud)
或者,您可以通过自行修改操作数本身来添加分隔符,但这可能会破坏其他东西(get_original_bytes
让我感到紧张的方法).
最后,pdf.py
如果您不想这样做,则无需编辑自己:您可以简单地将此方法拉入函数中.
归档时间: |
|
查看次数: |
5153 次 |
最近记录: |