如何将Word文档的页面拆分为c#中的单独文件

Ima*_*man 0 c# ms-word

我有一个OCR程序,可以将图像转换为word文档.word文档包含所有图像的文本,我想将其拆分为单独的文件.

在c#中有没有办法做到这一点?

谢谢

Zev*_*itz 5

如果安装了 Word,则可以使用 Word 对象模型从 C# 操作 Word 文档。

首先,添加对 Word 对象模型的引用。然后右键单击该项目Add Reference... -> COM -> Microsoft Word 14.0 Object Model(或类似的操作,具体取决于您的 Word 版本)。

然后,您可以使用以下代码:

using Microsoft.Office.Interop.Word;
//for older versions of Word use:
//using Word;

namespace WordSplitter {
    class Program {
        static void Main(string[] args) {
            //Create a new instance of Word
            var app = new Application();

            //Show the Word instance.
            //If the code runs too slowly, you can show the application at the end of the program
            //Make sure it works properly first; otherwise, you'll get an error in a hidden window
            //(If it still runs too slowly, there are a few other ways to reduce screen updating)
            app.Visible = true;

            //We need a reference to the source document
            //It should be possible to get a reference to an open Word document, but I haven't tried it
            var doc = app.Documents.Open(@"path\to\file.doc");
            //(Can also use .docx)

            int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];

            //We'll hold the start position of each page here
            int pageStart = 0;

            for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
                //This Range object will contain each page.
                var page = doc.Range(pageStart);

                //Generally, the end of the current page is 1 character before the start of the next.
                //However, we need to handle the last page -- since there is no next page, the 
                //GoTo method will move to the *start* of the last page.
                if (currentPageIndex < pageCount) {
                    //page.GoTo returns a new Range object, leaving the page object unaffected
                    page.End = page.GoTo(
                        What: WdGoToItem.wdGoToPage,
                        Which: WdGoToDirection.wdGoToAbsolute,
                        Count: currentPageIndex + 1
                    ).Start - 1;
                } else {
                    page.End = doc.Range().End;
                }
                pageStart = page.End + 1;

                //Copy and paste the contents of the Range into a new document
                page.Copy();
                var doc2 = app.Documents.Add();
                doc2.Range().Paste();
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

参考:MSDN 上的 Word 对象模型概述


Zev*_*itz 5

其他答案相同,但使用IEnumerator和文档的扩展方法.

static class PagesExtension {
    public static IEnumerable<Range> Pages(this Document doc) {
        int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
        int pageStart = 0;
        for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
            var page = doc.Range(
                pageStart
            );
            if (currentPageIndex < pageCount) {
                //page.GoTo returns a new Range object, leaving the page object unaffected
                page.End = page.GoTo(
                    What: WdGoToItem.wdGoToPage,
                    Which: WdGoToDirection.wdGoToAbsolute,
                    Count: currentPageIndex+1
                ).Start-1;
            } else {
                page.End = doc.Range().End;
            }
            pageStart = page.End + 1;
            yield return page;
        }
        yield break;
    }
}
Run Code Online (Sandbox Code Playgroud)

主要代码最终如下:

static void Main(string[] args) {
    var app = new Application();
    app.Visible = true;
    var doc = app.Documents.Open(@"path\to\source\document");
    foreach (var page in doc.Pages()) {
        page.Copy();
        var doc2 = app.Documents.Add();
        doc2.Range().Paste();
    }
}
Run Code Online (Sandbox Code Playgroud)