使用Apache POI从Word文档中提取段落

Question

使用Apache POI从Word文档中提取段落

我有一个word文档 Docx文件

正如你在单词文档中看到的那样,有一些关于子弹点的问题.现在我试图使用apache POI从文件中提取每个段落.这是我目前的代码

    public static String readDocxFile(String fileName) {
    try {
        File file = new File(fileName);
        FileInputStream fis = new FileInputStream(file.getAbsolutePath());
        XWPFDocument document = new XWPFDocument(fis);

        List<XWPFParagraph> paragraphs = document.getParagraphs();
        String whole = "";
        for (XWPFParagraph para : paragraphs) {
            System.out.println(para.getText());
            whole += "\n" + para.getText();
        }
        fis.close();
        document.close();
        return whole;
    } catch (Exception e) {
        e.printStackTrace();
        return "";
    }
    }

Run Code Online (Sandbox Code Playgroud)

上述方法的问题在于它是打印每一行而不是段落.此外,子弹点也从提取的whole字符串中消失.将whole返回一个普通字符串.

任何人都可以解释我做错了什么.如果你有更好的想法解决它,请建议.

Answer 1

rit*_*984 1

上面的代码是正确的，我在我的系统上运行了你的代码，给出了每个段落，我认为每当我在项目符号点中写入内容并使用“输入”键时，在 docx 文件上写入内容会出现问题，这会破坏我当前的项目符号点及以上代码将该断线作为单独的段落。

我写的下面的代码示例可能对您有用，看看这里我使用 Set datastruct 来忽略 docx 中的重复问题。

apache poi 的依赖如下

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.7</version>
</dependency>

Run Code Online (Sandbox Code Playgroud)

代码示例：

package com;

import java.io.File;
import java.io.FileInputStream;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.springframework.util.ObjectUtils;

public class App {

    public static void main(String...strings) throws Exception{
        Set<String> bulletPoints = fileExtractor(); 
        bulletPoints.forEach(point -> {
            System.out.println(point);
        });
    }

    public static Set<String> fileExtractor() throws Exception{
        FileInputStream fis = null;
        try {
            Set<String> bulletPoints = new HashSet<>();
            File file = new File("/home/deskuser/Documents/query.docx");
            fis = new FileInputStream(file.getAbsolutePath());
            XWPFDocument document = new XWPFDocument(fis);

            List<XWPFParagraph> paragraphs = document.getParagraphs();
            paragraphs.forEach(para -> {
                System.out.println(para.getText());
                if(!ObjectUtils.isEmpty(para.getText())){
                    bulletPoints.add(para.getText());
                }
            });
            fis.close();

            return bulletPoints;
        } catch (Exception e) {
            e.printStackTrace();
            throw new Exception("error while extracting file.", e);
        }finally{
            if(!ObjectUtils.isEmpty(fis)){
                fis.close();
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，3 月前
查看次数：	1389 次
最近记录：	8 年，3 月前