yos*_*ssi 22 java regex xml invalid-characters
您好我想从字符串中删除所有无效的XML字符.我想使用string.replace方法的正则表达式.
喜欢
line.replace(regExp,"");
什么是正确的regExp使用?
无效的XML字符是不是这样的一切:
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Run Code Online (Sandbox Code Playgroud)
谢谢.
McD*_*ell 76
Java的正则表达式支持增补字符,因此您可以使用两个UTF-16编码的字符指定那些高范围.
以下是删除XML 1.0中非法字符的模式:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
Run Code Online (Sandbox Code Playgroud)
大多数人都想要XML 1.0版本.
以下是删除XML 1.1中非法字符的模式:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "\u0001-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]+";
Run Code Online (Sandbox Code Playgroud)
你需要使用String.replaceAll(...)而不是String.replace(...).
String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");
Run Code Online (Sandbox Code Playgroud)
我们应该考虑代理人物吗?否则'(当前> = 0x10000)&&(当前<= 0x10FFFF)'永远不会成立.
还测试了正则表达式方式似乎比以下循环慢.
if (null == text || text.isEmpty()) {
return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
current = text.charAt(i);
boolean surrogate = false;
if (Character.isHighSurrogate(current)
&& i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
surrogate = true;
codePoint = text.codePointAt(i++);
} else {
codePoint = current;
}
if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
|| ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
|| ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
|| ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
sb.append(current);
if (surrogate) {
sb.append(text.charAt(i));
}
}
}
Run Code Online (Sandbox Code Playgroud)
到目前为止,所有这些答案都只是替换了角色本身。但有时 XML 文档会包含无效的 XML 实体序列,从而导致错误。例如,如果您的 xml 中有,则 java xml 解析器将抛出Illegal character entity: expansion character (code 0x2 at ....
这是一个简单的java程序,可以替换那些无效的实体序列。
public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");
/**
* Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
*/
String getCleanedXml(String xmlString) {
Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
Set<String> replaceSet = new HashSet<>();
while (m.find()) {
String group = m.group(1);
int val;
if (group != null) {
val = Integer.parseInt(group, 16);
if (isInvalidXmlChar(val)) {
replaceSet.add("&#x" + group + ";");
}
} else if ((group = m.group(2)) != null) {
val = Integer.parseInt(group);
if (isInvalidXmlChar(val)) {
replaceSet.add("&#" + group + ";");
}
}
}
String cleanedXmlString = xmlString;
for (String replacer : replaceSet) {
cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
}
return cleanedXmlString;
}
private boolean isInvalidXmlChar(int val) {
if (val == 0x9 || val == 0xA || val == 0xD ||
val >= 0x20 && val <= 0xD7FF ||
val >= 0x10000 && val <= 0x10FFFF) {
return false;
}
return true;
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
66450 次 |
| 最近记录: |