我看到很多关于将XML文件拆分成较小块的帖子/博客/文章,并决定自己创建,因为我有一些自定义要求.这就是我的意思,请考虑以下XML:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="2">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="3">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="4">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="5">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<salary>100000</salary>
</staff>
</company>
Run Code Online (Sandbox Code Playgroud)
我想将这个xml分成n个部分,每个部分包含1个文件,但是staff元素必须包含nickname,如果它不在那里我不想要它.因此,这应该产生4 xml拆分,每个拆分包含从1到4开始的员工ID.
这是我的代码:
public int split() throws Exception{
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputFilePath)));
String line;
List<String> tempList = null;
while((line=br.readLine())!=null){
if(line.contains("<?xml version=\"1.0\"") || line.contains("<" + rootElement + ">") || line.contains("</" + rootElement + ">")){
continue;
}
if(line.contains("<"+ element +">")){
tempList = new ArrayList<String>();
}
tempList.add(line);
if(line.contains("</"+ element +">")){
if(hasConditions(tempList)){
writeToSplitFile(tempList);
writtenObjectCounter++;
totalCounter++;
}
}
if(writtenObjectCounter == itemsPerFile){
writtenObjectCounter = 0;
fileCounter++;
tempList.clear();
}
}
if(tempList.size() != 0){
writeClosingRootElement();
}
return totalCounter;
}
private void writeToSplitFile(List<String> itemList) throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
if(writtenObjectCounter == 0){
wr.write("<" + rootElement + ">");
wr.write("\n");
}
for (String string : itemList) {
wr.write(string);
wr.write("\n");
}
if(writtenObjectCounter == itemsPerFile-1)
wr.write("</" + rootElement + ">");
wr.close();
}
private void writeClosingRootElement() throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
wr.write("</" + rootElement + ">");
wr.close();
}
private boolean hasConditions(List<String> list){
int matchList = 0;
for (String condition : conditionList) {
for (String string : list) {
if(string.contains(condition)){
matchList++;
}
}
}
if(matchList >= conditionList.size()){
return true;
}
return false;
}
Run Code Online (Sandbox Code Playgroud)
我知道每个书面staff元素的打开/关闭流确实影响了性能.但是如果我每个文件写一次(可能包含n个staff).自然根和拆分元素是可配置的.
任何想法如何改善性能/逻辑?我更喜欢一些代码,但有时候好的建议会更好
编辑:
这个XML示例实际上是一个虚拟示例,我正在尝试拆分的真正的XML是大约300-500个不同的元素,它们在随机顺序下出现,并且数量各不相同.Stax可能不是最好的解决方案吗?
赏金更新:
我正在寻找一个解决方案(代码),它将:
能够使用x split元素将XML文件拆分为n个部分(来自虚拟XML示例人员是拆分元素).
spitted文件的内容应该包装在原始文件的根元素中(就像在虚拟示例公司中一样)
我希望能够指定必须在split元素中的条件,即我只想要有昵称的工作人员,我想丢弃那些没有昵称的人.但是在没有条件的情况下运行拆分时也能够无条件地拆分.
代码不一定要改进我的解决方案(缺乏良好的逻辑和性能),但它的工作原理.
并不满意"但它有效".而且我找不到足够的Stax用于这类操作的例子,用户社区也不是很好.它也不一定是Stax解决方案.
我可能要求的太多,但我在这里学习东西,为我认为的解决方案提供了很好的奖励.
Jon*_*eet 20
第一条建议:不要尝试编写自己的XML处理代码.使用XML解析器-这将是很多更可靠,很可能更快.
如果您使用XML pull解析器(例如StAX),您应该能够一次读取一个元素并将其写入磁盘,而不是一次性读取整个文档.
Mic*_*Kay 10
这是我的建议.它需要一个流式XSLT 3.0处理器:这在实践中意味着它需要Saxon-EE 9.3.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:mode streamable="yes">
<xsl:template match="/">
<xsl:apply-templates select="company/staff"/>
</xsl:template>
<xsl:template match=staff">
<xsl:variable name="v" as="element(staff)">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:if test="$v/nickname">
<xsl:result-document href="{@id}.xml">
<xsl:copy-of select="$v"/>
</xsl:result-document>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Run Code Online (Sandbox Code Playgroud)
但实际上,除非你有数百兆字节的数据,否则我怀疑非流媒体解决方案将足够快,并且可能比你手工编写的Java代码更快,因为你的Java代码没什么好兴奋的.无论如何,在编写大量低级Java之前,先尝试一下XSLT解决方案.毕竟,这是一个常规问题.
您可以使用StAX执行以下操作:
算法
用例代码
以下代码使用StAX API来分解您的问题中概述的文档:
package forum7408938;
import java.io.*;
import java.util.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
public class Demo {
public static void main(String[] args) throws Exception {
Demo demo = new Demo();
demo.split("src/forum7408938/input.xml", "nickname");
//demo.split("src/forum7408938/input.xml", null);
}
private void split(String xmlResource, String condition) throws Exception {
XMLEventFactory xef = XMLEventFactory.newFactory();
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
StartDocument startDocument = xef.createStartDocument();
EndDocument endDocument = xef.createEndDocument();
XMLOutputFactory xof = XMLOutputFactory.newFactory();
while(xer.hasNext() && !xer.peek().isEndDocument()) {
boolean metCondition;
XMLEvent xmlEvent = xer.nextTag();
if(!xmlEvent.isStartElement()) {
break;
}
// BOUNTY CRITERIA
// Be able to split XML file into n parts with x split elements(from
// the dummy XML example staff is the split element).
StartElement breakStartElement = xmlEvent.asStartElement();
List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();
// BOUNTY CRITERIA
// I'd like to be able to specify condition that must be in the
// split element i.e. I want only staff which have nickname, I want
// to discard those without nicknames. But be able to also split
// without conditions while running split without conditions.
if(null == condition) {
cachedXMLEvents.add(breakStartElement);
metCondition = true;
} else {
cachedXMLEvents.add(breakStartElement);
xmlEvent = xer.nextEvent();
metCondition = false;
while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
cachedXMLEvents.add(xmlEvent);
if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
metCondition = true;
break;
}
xmlEvent = xer.nextEvent();
}
}
if(metCondition) {
// Create a file for the fragment, the name is derived from the value of the id attribute
FileWriter fileWriter = null;
fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");
// A StAX XMLEventWriter will be used to write the XML fragment
XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
xew.add(startDocument);
// BOUNTY CRITERIA
// The content of the spitted files should be wrapped in the
// root element from the original file(like in the dummy example
// company)
xew.add(rootStartElement);
// Write the XMLEvents that were cached while when we were
// checking the fragment to see if it matched our criteria.
for(XMLEvent cachedEvent : cachedXMLEvents) {
xew.add(cachedEvent);
}
// Write the XMLEvents that we still need to parse from this
// fragment
xmlEvent = xer.nextEvent();
while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
xew.add(xmlEvent);
xmlEvent = xer.nextEvent();
}
xew.add(xmlEvent);
// Close everything we opened
xew.add(xef.createEndElement(rootStartElement.getName(), null));
xew.add(endDocument);
fileWriter.close();
}
}
}
}
Run Code Online (Sandbox Code Playgroud)