如何使用apache poi获取doc,docx文件中特定单词的行号,页码？

Question

如何使用apache poi获取doc,docx文件中特定单词的行号,页码？

我正在尝试创建一个java application搜索所选doc, docx文件中的特定单词并生成报告的文件.该报告将包含搜索单词的页码和行号.现在我所取得的成就是我能够逐段阅读doc和docx文件.但我没有找到任何方法来搜索特定的单词并获得该单词所在的行和页码.我搜索了很多,但直到现在都没有运气.希望有人知道这样做的方法.

这是我的代码

if(fc.getSelectedFile().getAbsolutePath().contains("docx")) {
    File file = fc.getSelectedFile();
    FileInputStream fis = new FileInputStream(file.getAbsolutePath());
    XWPFDocument document = new XWPFDocument(fis);
    List<XWPFParagraph> paragraphs = document.getParagraphs();
    System.out.println("Total no of paragraph "+paragraphs.size());
    for (XWPFParagraph para : paragraphs) {
        System.out.println(para.getText());
    }
    fis.close();
} else {
    WordExtractor extractor = null;
    FileInputStream fis = new FileInputStream(fc.getSelectedFile());
    HWPFDocument document = new HWPFDocument(fis);
    extractor = new WordExtractor(document);
    String[] fileData = extractor.getParagraphText();
    for (int i = 0; i < fileData.length; i++) {
        if (fileData[i] != null)
            System.out.println(fileData[i]);
    }
    extractor.close();
}

Run Code Online (Sandbox Code Playgroud)

我在用swing,apache poi 3.10.1.

Answer 1

Sta*_*avL 5

恐怕没有简单的方法可以做到这一点.不存储行和页码,而是根据指定的页面大小,根据文本布局快速计算.该页面定义了文本中的包装位置.

您可以尝试使用适当的EditorKit在JEditorPane中加载文档来实现该功能(例如,参见DocxEditorKit实现的尝试http://java-sl.com/docx_editor_kit.html它提供了基本功能,您可以尝试实现这里基于源代码和想法拥有EditorKit).

该工具包应支持分页以呈现页面(请参阅此处有关分页的文章http://java-sl.com/articles.html)

分页完成后,您可以找到单词的位置(插入符号偏移量)并获取行/列(请参阅http://java-sl.com/tip_row_column.html).

归档时间：	11 年，3 月前
查看次数：	3326 次
最近记录：	11 年，3 月前