从 Word (Docx) 读取方程和公式到 html 并使用 java 保存数据库

NHT*_*NHT 2 java docx apache-poi docx4j

我有一个 word/docx 文件,其中包含图像下的方程 在此处输入图片说明

我想读取文件 word/docx 的数据并保存到我的数据库中,当需要时我可以从数据库中获取数据并显示在我的 html 页面上

Axe*_*ter 7

Word *.docx文件ZIP包含档案XML这是文件的Office Open XML。包含的公式Word *.docx文档中是Office MathML (OMML)

不幸的是,这种XML格式在外面并不为人所知Microsoft Office。所以它不能直接HTML用于例如。但幸运的是XML,它可以使用Transforming XML Data with XSLT进行转换。所以我们可以改变它OMML转换为MathML,例如,它可用于更广泛的用例领域。

转换过程XSLT主要基于XSL转换的定义。不幸的是,创建这样一个也不是很容易。但幸运的是Microsoft已经这样做了,如果你有一个当前Microsoft Office安装,你可以找到这个文件OMML2MML.XSLMicrosoft Office程序目录%ProgramFiles%\。如果找不到,请进行网络研究以获取它。

所以如果我们知道这一切,我们可以OMMLXWPFDocument,将其转换为MathML然后保存以备后用。

我的示例将找到的公式存储为 MathMLa ArrayListof 字符串。您还应该能够将此字符串存储在您的数据库中。

该示例需要https://poi.apache.org/faq.html#faq-N10025 中ooxml-schemas-1.3.jar提到的完整内容。这是因为它使用CTOMath,而较小的poi-ooxml-schemas jar.

Word文档:

在此处输入图片说明

爪哇代码:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadFormulas {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet); 

 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 public static void main(String[] args) throws Exception {
    
  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //storing the found MathML in a AllayList of strings
  List<String> mathMLList = new ArrayList<String>();

  //getting the formulas out of all body elements
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
     mathMLList.add(getMathML(ctomath));
    }
    for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
     for (CTOMath ctomath : ctomathpara.getOMathList()) {
      mathMLList.add(getMathML(ctomath));
     }
    }
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement; 
    for (XWPFTableRow row : table.getRows()) {
     for (XWPFTableCell cell : row.getTableCells()) {
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
        mathMLList.add(getMathML(ctomath));
       }
       for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
        for (CTOMath ctomath : ctomathpara.getOMathList()) {
         mathMLList.add(getMathML(ctomath));
        }
       }
      }
     }
    }
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");
  writer.write("<p>Following formulas was found in Word document: </p>");

  int i = 1;
  for (String mathML : mathMLList) {
   writer.write("<p>Formula" + i++ + ":</p>");
   writer.write(mathML);
   writer.write("<p/>");
  }

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}
Run Code Online (Sandbox Code Playgroud)

结果:

在此处输入图片说明


刚刚使用此代码进行了测试,apache poi 5.0.0并且可以正常工作。你需要poi-ooxml-full-5.0.0.jarapache poi 5.0.0. 请阅读https://poi.apache.org/help/faq.html#faq-N10025ooxml了解什么apache poi版本需要什么库。