将 .docx 文件转换为纯文本并保留换行符以维护对源文档的行号引用:如何和含义?

9 scripting conversion text microsoft-word

我正在将 MS Word内容导出为纯文本,以便与文本和文件实用程序一起使用。我有一个约束,即MS 软件中启用了行编号功能,并且最终输出中对行号的任何引用都必须与该编号匹配。所以输入“编号行”:

在此处输入图片说明坡,EA

显然,对于Word,这种编号不会在换行符处断,而是在右边距(或其他东西)之后断行。像docx2txt, 这样的脚本默认情况下不考虑这一点,它似乎并在换行符处换行。因此,如果我使用grep -n编号,则行将与源行号功能不匹配,如上所示。从文档中并不清楚我需要如何编辑 Perl 脚本以在这种情况下以我需要的方式转换文件:

our $config_newLine = "\n"; # Alternative is "\r\n".
our $config_lineWidth = 80; # Line width, used for short line justification.
Run Code Online (Sandbox Code Playgroud)

我尝试替代\n\r\n但这似乎对我不起作用。所以我使用以下设置直接从Word导出文档(另存为纯文本,在 v.2013,64pc 上):

  • Unicode(UTF-8)
  • 用 (CR/LF) 插入换行符 + 结束行
  • 允许字符替换

现在确实当我使用这些.txt文件时,源编号功能和grep -n输出中的行号之间存在完美匹配。


  • 有没有我应该知道的特定配置/过程docx2txt或类似的命令行实用程序,它可以让我将我的.docx文件转换为纯文本,同时保留换行符,而不像我那样求助于Word
  • 关于换行符和格式,将 MS Word文档(可能包含重音字符)导出为纯文本以与文件/文本实用程序一起使用的最佳实践是什么(如果有)?我为导出选择的设置(即插入 CR/LF)是否有任何负面影响?

样本

按照建议,我提供了一个样本。在这个 rar存档中,我捆绑了一个带有简单段落的.docx文件,以及它使用 Word 和上述选项导出的.txt文件。后者可以与docx2txt源文件上的默认运行进行比较。

Ant*_*hon 8

docx2txt works on the information in the docx file which is a zipped set of XML files.

With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.

So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².

If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result?.

Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.

¹这种不兼容性是我的一些朋友无法从 Windows 切换到 Linux 的主要原因,尽管他们愿意。他们必须使用 MicroSoft Word,因为 Open/LibreOffice 每隔一段时间就会破坏他们与客户交换的文本。
²您可以安装 Word 文件中使用的所有字体,并且在某些时候可能会获得一些文本的幸运。
³.doc/.docx
? 该程序使用 GUI 自动化——就像有人点击它的菜单一样——并且不会尝试通过 API 驱动 Word。我很确定后者也可以完成,并且如果 Word 升级,则具有不会破坏事物的优势