NHT*_*NHT 2 java docx apache-poi docx4j
我想读取文件 word/docx 的数据并保存到我的数据库中,当需要时我可以从数据库中获取数据并显示在我的 html 页面上
Word
*.docx
文件ZIP
包含档案XML
这是文件的Office Open XML。包含的公式Word
*.docx
文档中是Office MathML (OMML)。
不幸的是,这种XML
格式在外面并不为人所知Microsoft Office
。所以它不能直接HTML
用于例如。但幸运的是XML
,它可以使用Transforming XML Data with XSLT进行转换。所以我们可以改变它OMML
转换为MathML,例如,它可用于更广泛的用例领域。
转换过程XSLT
主要基于XSL
转换的定义。不幸的是,创建这样一个也不是很容易。但幸运的是Microsoft
已经这样做了,如果你有一个当前Microsoft Office
安装,你可以找到这个文件OMML2MML.XSL
在Microsoft Office
程序目录%ProgramFiles%\
。如果找不到,请进行网络研究以获取它。
所以如果我们知道这一切,我们可以OMML
从XWPFDocument
,将其转换为MathML
然后保存以备后用。
我的示例将找到的公式存储为 MathML
a ArrayList
of 字符串。您还应该能够将此字符串存储在您的数据库中。
该示例需要https://poi.apache.org/faq.html#faq-N10025 中ooxml-schemas-1.3.jar
提到的完整内容。这是因为它使用CTOMath,而较小的poi-ooxml-schemas jar
.
Word文档:
爪哇代码:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadFormulas {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>();
//getting the formulas out of all body elements
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
}
}
}
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write("<p>Following formulas was found in Word document: </p>");
int i = 1;
for (String mathML : mathMLList) {
writer.write("<p>Formula" + i++ + ":</p>");
writer.write(mathML);
writer.write("<p/>");
}
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Run Code Online (Sandbox Code Playgroud)
结果:
刚刚使用此代码进行了测试,apache poi 5.0.0
并且可以正常工作。你需要poi-ooxml-full-5.0.0.jar
为apache poi 5.0.0
. 请阅读https://poi.apache.org/help/faq.html#faq-N10025ooxml
了解什么apache poi
版本需要什么库。