9 scripting conversion text microsoft-word
我正在将 MS Word内容导出为纯文本,以便与文本和文件实用程序一起使用。我有一个约束,即MS 软件中启用了行编号功能,并且最终输出中对行号的任何引用都必须与该编号匹配。所以输入“编号行”:
(坡,EA)
显然,对于Word,这种编号不会在换行符处断行,而是在右边距(或其他东西)之后断行。像docx2txt
, 这样的脚本默认情况下不考虑这一点,它似乎并在换行符处换行。因此,如果我使用grep -n
编号,则行将与源行号功能不匹配,如上所示。从文档中并不清楚我需要如何编辑 Perl 脚本以在这种情况下以我需要的方式转换文件:
our $config_newLine = "\n"; # Alternative is "\r\n".
our $config_lineWidth = 80; # Line width, used for short line justification.
Run Code Online (Sandbox Code Playgroud)
我尝试替代\n
,\r\n
但这似乎对我不起作用。所以我使用以下设置直接从Word导出文档(另存为纯文本,在 v.2013,64pc 上):
现在确实当我使用这些.txt
文件时,源编号功能和grep -n
输出中的行号之间存在完美匹配。
docx2txt
或类似的命令行实用程序,它可以让我将我的.docx文件转换为纯文本,同时保留换行符,而不像我那样求助于Word?样本
按照建议,我提供了一个样本。在这个 rar存档中,我捆绑了一个带有简单段落的.docx文件,以及它使用 Word 和上述选项导出的.txt文件。后者可以与docx2txt
源文件上的默认运行进行比较。
docx2txt
works on the information in the docx
file which is a zipped set of XML files.
With regards to line wrapping the .docx
XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt
normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx
contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt
via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.
So docx2txt
or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².
If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result?.
Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt
largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.
¹这种不兼容性是我的一些朋友无法从 Windows 切换到 Linux 的主要原因,尽管他们愿意。他们必须使用 MicroSoft Word,因为 Open/LibreOffice 每隔一段时间就会破坏他们与客户交换的文本。
²您可以安装 Word 文件中使用的所有字体,并且在某些时候可能会获得一些文本的幸运。
³从.doc/.docx
? 该程序使用 GUI 自动化——就像有人点击它的菜单一样——并且不会尝试通过 API 驱动 Word。我很确定后者也可以完成,并且如果 Word 升级,则具有不会破坏事物的优势