1 java equation position formula apache-poi
我们正在构建一个java代码来使用apache POI将 word 文档 (.docx) 读入我们的程序中。当我们在文档中遇到公式和化学方程式时,我们会陷入困境。然而,我们设法阅读了公式,但我们不知道如何在相关字符串中找到其索引。
输入(格式为*.docx)
text before formulae **CHEMICAL EQUATION** text after
输出(格式应为HTML)我们设计的
text before formulae text after **CHEMICAL EQUATION**
我们无法获取字符串并将其重建为原始形式。
问题
现在有什么方法可以定位图像和公式在剥离线内的位置,以便在重建字符串后可以将其恢复到原始形式,而不是将其附加在字符串末尾。
如果需要的格式为HTML,则可以通过以下方式读取Word文本内容以及Office MathML方程。
在将方程式和公式从 Word (Docx) 读取为 html 并使用 java 保存数据库中,我提供了一个示例,它将所有Office MathML方程式从Word文档中取出到HTML. 它使用paragraph.getCTP().getOMathList()和paragraph.getCTP().getOMathParaList()来获取OMath段落中的元素。这会将OMath元素从文本上下文中取出。
如果想要将这些OMath元素与段落中的其他元素放在上下文中,则org.apache.xmlbeans.XmlCursor需要使用 a 来循环XML段落中的所有不同元素。以下示例使用XmlCursor来获取文本与OMath段落中的元素。
Office MathML从到MathML 的转换采用与将方程和公式从 Word (Docx) 读取为 html 并使用 java 保存数据库中XSLT相同的方法进行。还描述了来自哪里。OMML2MML.XSL
该文件Formula.docx如下所示:
代码:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.apache.xmlbeans.XmlCursor;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadTextWithFormulasAsHTML {
 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet);
 //method for getting MathML from oMath
 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);
  Node node = ctomath.getDomNode();
  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);
  String mathML = stringwriter.toString();
  stringwriter.close();
  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");
  return mathML;
 }
 //method for getting HTML including MathML from XWPFParagraph
 static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
  
  StringBuffer textWithFormulas = new StringBuffer();
  //using a cursor to go through the paragraph from top to down
  XmlCursor xmlcursor = paragraph.getCTP().newCursor();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
     //elements w:r are text runs within the paragraph
     //simply append the text data
     textWithFormulas.append(xmlcursor.getTextValue());
    } else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
     //we have oMath
     //append the oMath as MathML
     textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the paragraph
    xmlcursor.push();
    xmlcursor.toParent();
    if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
     break;
    }
    xmlcursor.pop();
   }
  }
  
  return textWithFormulas.toString();
 }
 public static void main(String[] args) throws Exception {
  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
  //using a StringBuffer for appending all the content as HTML
  StringBuffer allHTML = new StringBuffer();
  //loop over all IBodyElements - should be self explained
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    allHTML.append("<p>");
    allHTML.append(getTextAndFormulas(paragraph));
    allHTML.append("</p>");
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement;
    allHTML.append("<table border=1>");
    for (XWPFTableRow row : table.getRows()) {
     allHTML.append("<tr>");
     for (XWPFTableCell cell : row.getTableCells()) {
      allHTML.append("<td>");
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       allHTML.append("<p>");
       allHTML.append(getTextAndFormulas(paragraph));
       allHTML.append("</p>");
      }
      allHTML.append("</td>");
     }
     allHTML.append("</tr>");
    }
    allHTML.append("</table>");
   }
  }
  document.close();
  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");
  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");
  writer.write("</head>");
  writer.write("<body>");
  writer.write(allHTML.toString());
  writer.write("</body>");
  writer.write("</html>");
  writer.close();
  Desktop.getDesktop().browse(new File("result.html").toURI());
 }
}
结果:
刚刚使用测试了此代码apache poi 5.0.0并且它有效。你需要poi-ooxml-full-5.0.0.jar为apache poi 5.0.0. 请阅读https://poi.apache.org/help/faq.html#faq-N10025了解ooxml哪个版本需要哪些库apache poi。
| 归档时间: | 
 | 
| 查看次数: | 2089 次 | 
| 最近记录: |