Jok*_*ker 4 python pdf text python-3.x pdfminer
我想提取此pdf的文本:https://github.com/pdfminer/pdfminer.six/files/1887670/Wochenkarte-KW-15-Neu.pdf
\n\n当我使用以下代码提取文本时:
\n\ndef convert_pdf_to_txt(path):\n resource_manager = PDFResourceManager()\n device = None\n try:\n with StringIO() as string_writer, open(path, \'rb\') as pdf_file:\n device = TextConverter(resource_manager, string_writer, codec=\'utf-8\', laparams=LAParams())\n interpreter = PDFPageInterpreter(resource_manager, device)\n\n for page in PDFPage.get_pages(pdf_file, maxpages=1):\n interpreter.process_page(page)\n\n pdf_text = string_writer.getvalue()\n finally:\n if device:\n device.close()\n return pdf_text\nRun Code Online (Sandbox Code Playgroud)\n\n该文本与 pdf 的文本布局不对应。\n当前结果:
\n\nMontag 09.04.2018 \nMen\xc3\xbc 1 \n\nKl. Salat \n\n\nMen\xc3\xbc 2 \n\nKl. Salat \n\nSeelachs-Spinat-T\xc3\xbcrmchen mit Spinat-\nMasalla-Sauce und Reis \nCurrywurst mit Pommes \nRun Code Online (Sandbox Code Playgroud)\n\n预期结果:
\n\nMontag 09.04.2018 \nMen\xc3\xbc 1 \n\nKl. Salat Seelachs-Spinat-T\xc3\xbcrmchen mit Spinat-Masalla-Sauce und Reis \n\nMen\xc3\xbc 2 \n\nKl. Salat Currywurst mit Pommes \nRun Code Online (Sandbox Code Playgroud)\n\n我做错了什么或者我错过了什么?
\n关键是在 LAParams 中给出另一个行边距:
LAParams(line_margin=0.1)
Run Code Online (Sandbox Code Playgroud)
我的线路现在看起来像这样:
device = TextConverter(resource_manager, string_writer, codec='utf-8', laparams=LAParams(line_margin=0.1))
Run Code Online (Sandbox Code Playgroud)
归功于蒂姆
| 归档时间: |
|
| 查看次数: |
3864 次 |
| 最近记录: |