DOM与SAX XML解析大文件

Shr*_*pta 3 javascript java xml parsing dom

背景:

我有一个大的OWL(Web Ontology Language)文件(大约125MB或150万行),我想解析为一组制表符分隔值.我一直在研究SAX和DOM XML解析器,并发现了以下内容:

  • SAX允许逐个节点地读取文档,因此整个文档不在内存中.
  • DOM允许将整个文档同时放在内存中,但是有一个荒谬的开销.

SAX vs DOM用于大文件:

据我了解,

  • 如果我使用SAX,我将不得不逐个节点地迭代150万行代码.
  • 如果我使用DOM,我会有很大的开销,但结果会很快返回.

问题:

我需要能够在相同长度的类似文件上多次使用此解析器.

因此,我应该使用哪个解析器?

加分点:有没有人知道JavaScript的任何好的解析器.我意识到很多都是为Java而制作的,但我对JavaScript更加满意.

Rav*_*yal 5

Meet StAX

Just like SAX, StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM's bidirectional read/write support, its ease of use and SAX's CPU and memory efficiency.

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

JAXP API comparison

????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
?          JAXP API Property           ?          StAX           ?           SAX           ?          DOM          ?           TrAX            ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? API Style                            ? Pull events; streaming  ? Push events; streaming  ? In memory tree based  ? XSLT Rule based templates ?
? Ease of Use                          ? High                    ? Medium                  ? High                  ? Medium                    ?
? XPath Capability                     ? No                      ? No                      ? Yes                   ? Yes                       ?
? CPU and Memory Utilization           ? Good                    ? Good                    ? Depends               ? Depends                   ?
? Forward Only                         ? Yes                     ? Yes                     ? No                    ? No                        ?
? Reading                              ? Yes                     ? Yes                     ? Yes                   ? Yes                       ?
? Writing                              ? Yes                     ? No                      ? Yes                   ? Yes                       ?
? Create, Read, Update, Delete (CRUD)  ? No                      ? No                      ? Yes                   ? No                        ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

参考:
StAX属于您的XML工具箱吗?

StAX是一种"拉"型API.如上所述,有Cursor和Event Iterator API.API有读写两面.它比SAX更适合开发人员.与SAX一样,StAX不需要将整个文档保存在内存中.但是,与SAX不同,不需要读取整个文档.部分可以跳过.这可能导致甚至比SAX提高性能.