在 pypandoc (pandoc) 中将较大的 HTML 文件转换为 docx 时出现问题

Question

在 pypandoc (pandoc) 中将较大的 HTML 文件转换为 docx 时出现问题

我的问题与如何增加 pandoc 执行中的堆内存有关？，但添加了特定于 Python 的组件。

背景：我正在尝试自动生成科学报告。我已将数据写入 HTML 文件，并且想使用 Pandoc.exe（文件转换程序）将其转换为 .docx Word 文档。我已经掌握了处理带有图像、表格等的较小 HTML 文件的流程。该文件为 307KB。

当我尝试转换嵌入多个图形的较大文件（~4.5MB）时，问题就开始了。我一直在使用pypandoc转换，如下所示：

import pypandoc
PANDOC_PATH = r"C:\Program Files\RStudio\bin\pandoc"

infile = savepath + os.sep + 'Results ' + name + '.html'
outfile = savepath + os.sep + 'Results ' + name + '.docx'

output = pypandoc.convert(source=infile, format='html', to='docx', \
outputfile=outfile, extra_args=["+RTS", "-K64m", "-RTS"])

Run Code Online (Sandbox Code Playgroud)

但我遇到了各种各样的错误。通常：

RuntimeError: Pandoc died with exitcode "2" during conversion: 
b"Stack space overflow: current size 33692 bytes.\nUse `+RTS -Ksize -RTS' to increase it.\n"

Run Code Online (Sandbox Code Playgroud)

或者如果我将 -Ksize 的值调至 256m，如下所示：

RuntimeError: Pandoc died with exitcode "1" during conversion: b'pandoc: out of memory\r\n'

Run Code Online (Sandbox Code Playgroud)

有人可以解释一下这里发生了什么，以及我可以解决这个困难的方法吗？ 我考虑过的一个解决方案是使我的图像小很多。我刚刚像这样缩小了（80 - 500KB）原始图像，其中每个图像的宽度和高度取决于其原始尺寸：

data_uri = base64.b64encode(open(formats[graph][0], 'rb').read()).decode('utf-8')

img_tag = ('<img src="data:image/jpg;base64,{0}" height='+formats[graph][2][0]+'
             width='+formats[graph][2][1]+'>').format(data_uri)

Run Code Online (Sandbox Code Playgroud)

感谢您的帮助

Answer 1

Hal*_*kal 5

非常感谢用户2407038对此的帮助！

两个修复最终允许我将较大的 HTML 文件转换为 docx 文件pypandoc：

第一个，正如建议的那样，是

增加堆的最大大小，例如将 -M2GB 添加到 extra_args

那是：

output = pypandoc.convert(source=infile, format='html', to='docx', outputfile=outfile, extra_args=["-M2GB", "+RTS", "-K64m", "-RTS"])

增加堆大小后，我仍然遇到第二个问题，所以我不确定该解决方案是否有效。Python 返回了如下错误消息：

RuntimeError：Pandoc 在转换期间因退出代码“1”而死亡：b“pandoc：无法解码字节 '\x91'：Data.Text.Internal.Encoding.Fusion.streamUtf8：无效的 UTF-8 流\n”

这是通过首先更改 html 文件的打开方式来解决的。将encoding关键字参数设置为'utf8'允许转换工作：

report = open(savepath + os.sep + 'Results ' + name + '.html', 'w', encoding='utf8')

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，7 月前
查看次数：	3701 次
最近记录：	9 年，7 月前