将 .docx 文件转换为纯文本并保留换行符以维护对源文档的行号引用：如何和含义？

Question

将 .docx 文件转换为纯文本并保留换行符以维护对源文档的行号引用：如何和含义？

9 scripting conversion text microsoft-word

我正在将 MS Word内容导出为纯文本，以便与文本和文件实用程序一起使用。我有一个约束，即MS 软件中启用了行编号功能，并且最终输出中对行号的任何引用都必须与该编号匹配。所以输入“编号行”：

在此处输入图片说明（坡，EA）

显然，对于Word，这种编号不会在换行符处断行，而是在右边距（或其他东西）之后断行。像docx2txt, 这样的脚本默认情况下不考虑这一点，它似乎并在换行符处换行。因此，如果我使用grep -n编号，则行将与源行号功能不匹配，如上所示。从文档中并不清楚我需要如何编辑 Perl 脚本以在这种情况下以我需要的方式转换文件：

our $config_newLine = "\n"; # Alternative is "\r\n".
our $config_lineWidth = 80; # Line width, used for short line justification.

Run Code Online (Sandbox Code Playgroud)

我尝试替代\n，\r\n但这似乎对我不起作用。所以我使用以下设置直接从Word导出文档（另存为纯文本，在 v.2013,64pc 上）：

Unicode(UTF-8)
用 (CR/LF) 插入换行符 + 结束行
允许字符替换

现在确实当我使用这些.txt文件时，源编号功能和grep -n输出中的行号之间存在完美匹配。

有没有我应该知道的特定配置/过程docx2txt或类似的命令行实用程序，它可以让我将我的.docx文件转换为纯文本，同时保留换行符，而不像我那样求助于Word？
关于换行符和格式，将 MS Word文档（可能包含重音字符）导出为纯文本以与文件/文本实用程序一起使用的最佳实践是什么（如果有）？我为导出选择的设置（即插入 CR/LF）是否有任何负面影响？

样本

按照建议，我提供了一个样本。在这个 rar存档中，我捆绑了一个带有简单段落的.docx文件，以及它使用 Word 和上述选项导出的.txt文件。后者可以与docx2txt源文件上的默认运行进行比较。

Answer 1

Ant*_*hon 8

docx2txt works on the information in the docx file which is a zipped set of XML files.

With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.

So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².

If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result?.

Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.

¹_{这种不兼容性是我的一些朋友无法从 Windows 切换到 Linux 的主要原因，尽管他们愿意。他们必须使用 MicroSoft Word，因为 Open/LibreOffice 每隔一段时间就会破坏他们与客户交换的文本。}
²_{您可以安装 Word 文件中使用的所有字体，并且在某些时候可能会获得一些文本的幸运。}
³_{从.doc/.docx}
? _{该程序使用 GUI 自动化——就像有人点击它的菜单一样——并且不会尝试通过 API 驱动 Word。我很确定后者也可以完成，并且如果 Word 升级，则具有不会破坏事物的优势}

归档时间：	11 年，4 月前
查看次数：	10175 次
最近记录：	11 年，2 月前