Pau*_*aul 7 java xml parsing sax tokenize
我正在尝试找到一种方法来在解析XML文档时精确确定标记和属性的行号和字符位置.我想这样做,以便我可以准确地向文档无效的XML文档的作者(通过Web界面)报告.
最后,我想将插入符号设置为无效标记或仅在无效属性的开放引用内.(我现在不使用XML Schema,因为属性的确切格式是以一种单独的模式无法验证的方式.我甚至可能希望报告某些属性在属性的值中部分无效.或者类似,在开始和结束标记之间的文本部分.)
我尝试过使用SAX(org.xml.sax)和Locator接口.这可以达到一定程度,但还不够好.它只会在事件发生后报告读取位置; 例如,对于startElement(),紧接在open标记结束之后的字符.我不能只减去标签名称的长度,因为open标签中的属性,自动关闭标签和/或换行符会将其丢弃.(并且Locator根本不提供有关属性位置的信息.)
理想情况下,我希望使用基于事件的方法,因为我已经有一个SAX处理程序,它正在构建一个内部类似DOM的表示或进一步处理.但是,我有兴趣了解任何DOM或类似DOM的库,其中包含模型元素的精确位置信息.
有没有人用所需的精确度解决了这个问题,或者任何类似问题?
我编写了一个快速 xml 文件,该文件获取行号,并在出现不需要的属性时引发异常,并给出引发错误的文本。
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Stack;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.log4j.Logger;
import org.w3c.dom.Document;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class LocatorTestSAXReader {
private static final Logger logger = Logger.getLogger(LocatorTestSAXReader.class);
private static final String XML_FILE_PATH = "lib/xml/test-instance1.xml";
public Document readXMLFile(){
Document doc = null;
SAXParser parser = null;
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
try {
parser = saxFactory.newSAXParser();
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.newDocument();
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
StringBuilder text = new StringBuilder();
DefaultHandler eleHandler = new DefaultHandler(){
private Locator locator;
@Override
public void characters(char[] ch, int start, int length){
String thisText = new String(ch, start, length);
if(thisText.matches(".*[a-zA-z]+.*")){
text.append(thisText);
logger.debug("element text: " + thisText);
}
}
@Override
public void setDocumentLocator(Locator locator){
this.locator = locator;
}
@Override
public void startElement(final String uri, final String localName, final String qName,
final Attributes attributes)
throws SAXException {
int lineNum = locator.getLineNumber();
logger.debug("I am now on line " + lineNum + " at element " + qName);
int len = attributes.getLength();
for(int i=0;i<len;i++){
String attVal = attributes.getValue(i);
String attName = attributes.getQName(i);
logger.debug("att " + attName + "=" + attVal);
if(attName.startsWith("bad")){
throw new SAXException("found attr : " + attName + "=" + attVal + " that starts with bad! at line : " +
locator.getLineNumber() + " at element " + qName + "\nelement occurs below text : " + text);
}
}
}
};
try {
parser.parse(new FileInputStream(new File(XML_FILE_PATH)), eleHandler);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
}
Run Code Online (Sandbox Code Playgroud)
关于文本,根据 xml 文件中错误发生的位置,可能没有任何文本。所以对于这个 xml:
<?xml version="1.0"?>
<root>
<section>
<para>This is a quick doc to test the ability to get line numbers via the Locator object. </para>
</section>
<section bad:attr="ok">
<para>another para.</para>
</section>
</root>
Run Code Online (Sandbox Code Playgroud)
如果不良属性位于第一个元素中,则文本将为空白。在本例中,抛出的异常是:
org.xml.sax.SAXException: found attr : bad:attr=ok that starts with bad! at line : 6 at element section
element occurs below text : This is a quick doc to test the ability to get line numbers via the Locator object.
Run Code Online (Sandbox Code Playgroud)
当您说您尝试使用 Locator 对象时,问题到底是什么?