pyPdf忽略PDF文件中的换行符

Joe*_*nin 5 python pdf string unicode pypdf

我正在尝试将PDF的每个页面提取为字符串:

import pyPdf

pages = []
pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb'))
for i in range(0, pdf.getNumPages()):
    this_page = pdf.getPage(i).extractText() + "\n"
    this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split())
    pages.append(this_page.encode("ascii", "xmlcharrefreplace"))
for page in pages:
    print '*' * 80
    print page
Run Code Online (Sandbox Code Playgroud)

但是这个脚本忽略了换行符,让我看起来像乱码information concerning an individual which, because of name, identifyingnumber, mark or description(比如,这应该读identifying number,不是identifyingumber).

这是我试图解析的PDF类型的一个例子.

DSM*_*DSM 9

我对PDF编码了解不多,但我认为您可以通过修改来解决您的特定问题pdf.py.在该PageObject.extractText方法中,您可以看到发生了什么:

def extractText(self):
    [...]
    for operands,operator in content.operations:
        if operator == "Tj":
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == "T*":
            text += "\n"
        elif operator == "'":
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == '"':
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == "TJ":
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
Run Code Online (Sandbox Code Playgroud)

如果运算符是TjTJ(在示例PDF中为Tj),则只是附加文本并且不添加换行符.现在你不一定添加换行符,至少如果我正在阅读PDF参考权限:Tj/TJ只是单个和多个show-string运算符,并且某种类型的分隔符的存在不是强制性的.

无论如何,如果你修改这个代码是这样的

def extractText(self, Tj_sep="", TJ_sep=""):
Run Code Online (Sandbox Code Playgroud)

[...]

        if operator == "Tj":
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += Tj_sep
                text += _text
Run Code Online (Sandbox Code Playgroud)

[...]

        elif operator == "TJ":
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += TJ_sep
                    text += i
Run Code Online (Sandbox Code Playgroud)

那么默认行为应该是相同的:

In [1]: pdf.getPage(1).extractText()[1120:1250]
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'
Run Code Online (Sandbox Code Playgroud)

但是你可以在你想要的时候改变它:

In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '
Run Code Online (Sandbox Code Playgroud)

要么

In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250]
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '
Run Code Online (Sandbox Code Playgroud)

或者,您可以通过自行修改操作数本身来添加分隔符,但这可能会破坏其他东西(get_original_bytes让我感到紧张的方法).

最后,pdf.py如果您不想这样做,则无需编辑自己:您可以简单地将此方法拉入函数中.