为什么SAXParser在投掷事件之前阅读了这么多?

Mar*_*cel 5 java xml sax stream saxparser

场景:我通过极慢的网络收到一个巨大的xml文件,所以我想尽早开始过度处理.因此,我决定使用SAXParser.

我希望在标签完成后我会得到一个事件.

以下测试显示了我的意思:

@Test
public void sax_parser_read_much_things_before_returning_events() throws Exception{
    String xml = "<a>"
               + "  <b>..</b>"
               + "  <c>..</c>"
                  // much more ...
               + "</a>";

    // wrapper to show what is read
    InputStream is = new InputStream() {
        InputStream is = new ByteArrayInputStream(xml.getBytes());

        @Override
        public int read() throws IOException {
            int val = is.read();
            System.out.print((char) val);
            return val;
        }
    };

    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(is, new DefaultHandler(){
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            System.out.print("\nHandler start: " + qName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            System.out.print("\nHandler end: " + qName);
        }
    });
}
Run Code Online (Sandbox Code Playgroud)

我将输入流包装起来以查看读取的内容以及事件发生的时间.

我的期望是这样的:

<a>                    <- output from read()
Handler start: a
<b>                    <- output from read()
Handler start: b
</b>                   <- output from read()
Handler end: b
...
Run Code Online (Sandbox Code Playgroud)

可悲的是结果如下:

<a>  <b>..</b>  <c>..</c></a>        <- output from read()
Handler start: a
Handler start: b
Handler end: b
Handler start: c
Handler end: c
Handler end: a
Run Code Online (Sandbox Code Playgroud)

我的错误在哪里,我怎样才能得到预期的结果?

编辑:

  • 首先,他正在尝试检测doc版本,这会导致扫描所有内容.随着doc版本,他介于两者之间(但不是我期望的)
  • 他"想要"读取例如1000字节和块这么久是不行的,因为在这个时间点它可能不包含那么多流.
  • 我在XMLEntityManager中找到了缓冲区大小:
    • public static final int DEFAULT_BUFFER_SIZE = 8192;
    • public static final int DEFAULT_XMLDECL_BUFFER_SIZE = 64;
    • public static final int DEFAULT_INTERNAL_BUFFER_SIZE = 1024;

Hol*_*ger 2

您似乎对 I/O 的工作方式做出了错误的假设。与大多数软件一样,XML 解析器将以块的形式请求数据,因为从流中请求单个字节会导致性能灾难。

\n\n

这并不意味着缓冲区必须在读取尝试返回之前完全填满。只是,aByteArrayInputStream无法模拟网络的行为InputStream。您可以通过覆盖read(byte[], int, int)而不返回完整的缓冲区来轻松解决此问题,但例如每个请求返回一个字节:

\n\n
@Test\npublic void sax_parser_read_much_things_before_returning_events() throws Exception{\n    final String xml = "<a>"\n               + "  <b>..</b>"\n               + "  <c>..</c>"\n                  // much more ...\n               + "</a>";\n\n    // wrapper to show what is read\n    InputStream is = new InputStream() {\n        InputStream is = new ByteArrayInputStream(xml.getBytes());\n\n        @Override\n        public int read() throws IOException {\n            int val = is.read();\n            System.out.print((char) val);\n            return val;\n        }\n        @Override\n        public int read(byte[] b, int off, int len) throws IOException {\n            return super.read(b, off, 1);\n        }\n    };\n\n    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();\n    parser.parse(is, new DefaultHandler(){\n        @Override\n        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {\n            System.out.print("\\nHandler start: " + qName);\n        }\n\n        @Override\n        public void endElement(String uri, String localName, String qName) throws SAXException {\n            System.out.print("\\nHandler end: " + qName);\n        }\n    });\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

这将打印

\n\n
@Test\npublic void sax_parser_read_much_things_before_returning_events() throws Exception{\n    final String xml = "<a>"\n               + "  <b>..</b>"\n               + "  <c>..</c>"\n                  // much more ...\n               + "</a>";\n\n    // wrapper to show what is read\n    InputStream is = new InputStream() {\n        InputStream is = new ByteArrayInputStream(xml.getBytes());\n\n        @Override\n        public int read() throws IOException {\n            int val = is.read();\n            System.out.print((char) val);\n            return val;\n        }\n        @Override\n        public int read(byte[] b, int off, int len) throws IOException {\n            return super.read(b, off, 1);\n        }\n    };\n\n    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();\n    parser.parse(is, new DefaultHandler(){\n        @Override\n        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {\n            System.out.print("\\nHandler start: " + qName);\n        }\n\n        @Override\n        public void endElement(String uri, String localName, String qName) throws SAXException {\n            System.out.print("\\nHandler end: " + qName);\n        }\n    });\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

显示 XML 解析器如何适应InputStream.

\n

  • read(byte[], int, int) 可以简化为 return super.read(b, off, 1); 。 (2认同)