使用APACHE POI处理docx文件

Luk*_*eva 2 java apache file docx apache-poi

我正在尝试从数据库检索docx,并尝试通过检查其内容来对其进行处理。我认为mycode检索了我想要的文件,但似乎我还没有完全理解APACHE POI。我在stacktrace上遇到一个错误,说我对POI的想法有误吗?

这是我加载文件的方式:

public void loadFile(String FileName)
{
    InputStream is = null;
    try
    {
        //Connecting to MYSQL Database
        Class.forName(driver).newInstance();
        con = DriverManager.getConnection(url+dbName,userName,password);

        Statement stmt = (Statement) con.createStatement();
        ResultSet rs = stmt.executeQuery("SELECT FILE FROM doccompfiles WHERE FileName = '"+ FileName +"'");

        while(rs.next())
        {
            is = rs.getBinaryStream("FILE");
        }

        HWPFDocument doc = new HWPFDocument(is);
        WordExtractor we = new WordExtractor(doc);

        String[] paragraphs = we.getParagraphText();
        JOptionPane.showMessageDialog(null, "Number of Paragraphs" + paragraphs.length);
        con.close();
    }
    catch(Exception ex)
    {
        ex.printStackTrace();
    }
}
Run Code Online (Sandbox Code Playgroud)

堆栈跟踪:

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:131)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:106)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
at documentComparisor.Database.loadFile(Database.java:156)
at documentComparisor.Home$5.actionPerformed(Home.java:195)
at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(Unknown Source)
at java.awt.Component.processMouseEvent(Unknown Source)
at javax.swing.JComponent.processMouseEvent(Unknown Source)
at java.awt.Component.processEvent(Unknown Source)
at java.awt.Container.processEvent(Unknown Source)
at java.awt.Component.dispatchEventImpl(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Window.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
at java.awt.EventQueue.access$000(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source)
at java.awt.EventQueue.dispatchEvent(Unknown Source)
at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.run(Unknown Source)
Run Code Online (Sandbox Code Playgroud)

Art*_*nov 5

您应该知道,目前MS Office文档以两种不同的格式存在:一种是2007年之前版本的MS Office使用的旧格式(例如“ .doc”或“ .xls”),另一种是基于XML的格式。较新版本使用的格式(例如“ .docx”或“ .xlsx”)。

Apache POI中有处理不同格式的不同部分。用于处理旧MS Office格式文件的密钥类的名称通常以“ H”开头,用于处理基于XML格式文件的密钥类的名称以“ X”开头。

因此,在示例中,为了处理新格式,您应该使用XWPFDocument而不是HWPFDocument:

XWPFDocument doc = new XWPFDocument(is);
Run Code Online (Sandbox Code Playgroud)