我发现cElementTree快了大约30倍xml.dom.minidom,我正在重写我的XML编码/解码代码.但是,我需要输出包含CDATA部分的XML,并且似乎没有办法使用ElementTree.
可以吗?
eli*_*ner 27
经过一番工作,我自己找到了答案.查看ElementTree.py源代码,我发现XML注释和预处理指令有特殊处理.他们所做的是为特殊元素类型创建一个工厂函数,该函数使用特殊(非字符串)标记值来区分它与常规元素.
def Comment(text=None):
element = Element(Comment)
element.text = text
return element
Run Code Online (Sandbox Code Playgroud)
然后在_write实际输出XML的ElementTree函数中,有一个特殊的案例处理注释:
if tag is Comment:
file.write("<!-- %s -->" % _escape_cdata(node.text, encoding))
Run Code Online (Sandbox Code Playgroud)
为了支持CDATA部分,我创建了一个名为的工厂函数CDATA,扩展了ElementTree类并更改了_write处理CDATA元素的函数.
如果你想用CDATA部分解析XML然后再用CDATA部分输出它,这仍然无济于事,但它至少允许你以编程方式创建带有CDATA部分的XML,这是我需要做的.
该实现似乎适用于ElementTree和cElementTree.
import elementtree.ElementTree as etree
#~ import cElementTree as etree
def CDATA(text=None):
element = etree.Element(CDATA)
element.text = text
return element
class ElementTreeCDATA(etree.ElementTree):
def _write(self, file, node, encoding, namespaces):
if node.tag is CDATA:
text = node.text.encode(encoding)
file.write("\n<![CDATA[%s]]>\n" % text)
else:
etree.ElementTree._write(self, file, node, encoding, namespaces)
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = ElementTreeCDATA(e)
et.write(sys.stdout, "utf-8")
Run Code Online (Sandbox Code Playgroud)
小智 10
以下是适用于python 3.2的gooli解决方案的变体:
import xml.etree.ElementTree as etree
def CDATA(text=None):
element = etree.Element('![CDATA[')
element.text = text
return element
etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
if elem.tag == '![CDATA[':
write("\n<%s%s]]>\n" % (
elem.tag, elem.text))
return
return etree._original_serialize_xml(
write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = etree.ElementTree(e)
et.write(sys.stdout.buffer.raw, "utf-8")
Run Code Online (Sandbox Code Playgroud)
我不知道先前版本的拟议代码是否运行良好以及ElementTree模块是否已更新但我遇到了使用此技巧的问题:
import xml.etree.ElementTree as ElementTree
def CDATA(text=None):
element = ElementTree.Element('![CDATA[')
element.text = text
return element
ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
if elem.tag == '![CDATA[':
write("\n<{}{}]]>\n".format(elem.tag, elem.text))
if elem.tail:
write(_escape_cdata(elem.tail))
else:
return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)
ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)
Run Code Online (Sandbox Code Playgroud)
这种方法的问题在于,在传递此异常之后,序列化程序再次将其视为普通标记.我得到的东西是这样的:
etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
if elem.tag == '![CDATA[':
write("\n<%s%s]]>\n" % (
elem.tag, elem.text))
return
return etree._original_serialize_xml(
write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml
Run Code Online (Sandbox Code Playgroud)
当然,我们知道这只会导致很多错误.为什么会这样呢?
答案就在这个小家伙身上:
<textContent>
<![CDATA[this was the code I wanted to put inside of CDATA]]>
<![CDATA[>this was the code I wanted to put inside of CDATA</![CDATA[>
</textContent>
Run Code Online (Sandbox Code Playgroud)
如果我们已经困住了我们的CDATA并成功通过了它,我们不想再通过原始的序列化函数来检查代码.因此,在"if"块中,只有当CDATA不存在时,我们才必须返回原始序列化函数.在返回原始函数之前,我们错过了"else".
而且在我的版本ElementTree模块中,serialize函数拼命地要求"short_empty_element"参数.所以我推荐的最新版本看起来像这样(也有"尾巴"):
return etree._original_serialize_xml(write, elem, qnames, namespaces)
Run Code Online (Sandbox Code Playgroud)
我得到的输出是:
from xml.etree import ElementTree
from xml import etree
#in order to test it you have to create testing.xml file in the folder with the script
xmlParsedWithET = ElementTree.parse("testing.xml")
root = xmlParsedWithET.getroot()
def CDATA(text=None):
element = ElementTree.Element('![CDATA[')
element.text = text
return element
ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
if elem.tag == '![CDATA[':
write("\n<{}{}]]>\n".format(elem.tag, elem.text))
if elem.tail:
write(_escape_cdata(elem.tail))
else:
return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)
ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)
#tests
print(root)
print(root.getchildren()[0])
print(root.getchildren()[0].text + "\n\nyay!")
Run Code Online (Sandbox Code Playgroud)
祝你有同样的结果!
小智 6
实际上这段代码有一个错误,因为你没有]]>在你作为CDATA插入的数据中出现
在这种情况下你应该把它分成两个CDATA,在]]>两者之间分开.
基本上data = data.replace("]]>", "]]]]><![CDATA[>")
(不一定正确,请验证)
您可以重写 ElementTree_escape_cdata函数:
import xml.etree.ElementTree as ET
def _escape_cdata(text, encoding):
try:
if "&" in text:
text = text.replace("&", "&")
# if "<" in text:
# text = text.replace("<", "<")
# if ">" in text:
# text = text.replace(">", ">")
return text
except TypeError:
raise TypeError(
"cannot serialize %r (type %s)" % (text, type(text).__name__)
)
ET._escape_cdata = _escape_cdata
Run Code Online (Sandbox Code Playgroud)
请注意,您可能不需要传递额外的encoding参数,具体取决于您的库/Python 版本。
现在您可以将 CDATA 写入obj.text如下内容:
root = ET.Element('root')
body = ET.SubElement(root, 'body')
body.text = '<![CDATA[perform extra angle brackets escape for this text]]>'
print(ET.tostring(root))
Run Code Online (Sandbox Code Playgroud)
并获取清晰的CDATA节点:
<root>
<body>
<![CDATA[perform extra angle brackets escape for this text]]>
</body>
</root>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
39181 次 |
| 最近记录: |