使用Transformer处理空CDATA时的IndexOutOfBoundsException

hal*_*oei 6 java xml stax

我想从大型XML文件中提取特定节点.这很有效,直到出现没有任何内容的狂野CDATA.

输出:

ERROR:  ''
javax.xml.transform.TransformerException: java.lang.IndexOutOfBoundsException
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:732)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:336)
    at xml_test.XML_Test.extractXML2(XML_Test.java:698)
    at xml_test.XML_Test.main(XML_Test.java:811)
Caused by: java.lang.IndexOutOfBoundsException
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getTextCharacters(XMLStreamReaderImpl.java:1143)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleCharacters(StAXStream2SAX.java:261)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:171)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:120)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:674)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:723)
    ... 3 more
---------
java.lang.IndexOutOfBoundsException
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getTextCharacters(XMLStreamReaderImpl.java:1143)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleCharacters(StAXStream2SAX.java:261)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:171)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:120)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:674)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:723)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:336)
    at xml_test.XML_Test.extractXML2(XML_Test.java:698)
    at xml_test.XML_Test.main(XML_Test.java:811)
Run Code Online (Sandbox Code Playgroud)

代码:

InputStream stream = new FileInputStream("C:\\myFile.xml");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader reader = factory.createXMLStreamReader(stream);

TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();

String extractPath = "/root";
String path = "";

while(reader.hasNext()) {
    reader.next();

    if(reader.isStartElement()) {
        path += "/" + reader.getLocalName();

        if(path.equals(extractPath)) {
            StringWriter writer = new StringWriter();
            StAXSource src = new StAXSource(reader);
            StreamResult res = new StreamResult(writer);
            t.transform(src, res); // Exception thrown

            System.out.println(writer.toString());

            path = path.substring(0, path.lastIndexOf("/"));
        }
    }
    else if(reader.isEndElement()) {
        path = path.substring(0, path.lastIndexOf("/"));
    }
}
Run Code Online (Sandbox Code Playgroud)

引发错误的XML:

<foo><![CDATA[]]></foo>
Run Code Online (Sandbox Code Playgroud)

我可以Transformer做到只是忽略它吗?或者另一个实现是什么样的?我无法更改输入XML!

Car*_*des 4

这是 Xerces 实现的问题,请检查: https://issues.apache.org/jira/browse/XERCESJ-1033

看来空的 CDATA 不应该存在,所以我能给你的唯一建议是:

  1. 更改 XML 解析器实现
  2. 从源文件中删除空 CDATA(将“ <![CDATA[]]>”替换为“”)
    或在 CDATA 中添加空格,例如<![CDATA[ ]]>

我用另一个实现添加了一些示例。

贾克斯布

在 Jaxb 中,您可以通过简单的方式将 XML 映射到 POJO。

例如,如果 c:\myFile.xml 中有下一个 xml 文件:

<root>
  <foo><![CDATA[]]></foo>
  <foo><![CDATA[some data here]]></foo>
</root>
Run Code Online (Sandbox Code Playgroud)

您可以拥有下一个 POJO:

@XmlRootElement
public class Root {

  @XmlElement(name="foo")
  privateList<Foo> foo;

  public List<Foo> getFooList() {
    return foo;
  }

  public void setFooList(List<Foo> fooList) {
    this.foo = fooList;
  }

}

@XmlType(name = "foo")
public class Foo {

  @XmlValue
  private String content;

  @Override
  public String toString() {
    return content;
  }

}
Run Code Online (Sandbox Code Playgroud)

然后使用下一个片段从 XML 解析为对象:

    public static void main(String[] args) {
    try {

        File file = new File("C:\\myFile.xml");
        JAXBContext jaxbContext = JAXBContext.newInstance(Root.class);

        Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
        Root root = (Root) jaxbUnmarshaller.unmarshal(file);

        for (Foo foo : root.getFooList()) {
            System.out.println(String.format("Foo content: |%s|", foo));
        }

    } catch (JAXBException e) {
        e.printStackTrace();
    }

}
Run Code Online (Sandbox Code Playgroud)

我测试了这个并没有引发任何错误。