如何从其构造函数中设置ElementTree Element的文本字段?或者,在下面的代码中,为什么root.text的第二次打印无?
import xml.etree.ElementTree as ET
root = ET.fromstring("<period units='months'>6</period>")
ET.dump(root)
print root.text
root=ET.Element('period', {'units': 'months'}, text='6')
ET.dump(root)
print root.text
root=ET.Element('period', {'units': 'months'})
root.text = '6'
ET.dump(root)
print root.text
Run Code Online (Sandbox Code Playgroud)
这里输出:
<period units="months">6</period>
6
<period text="6" units="months" />
None
<period units="months">6</period>
6
Run Code Online (Sandbox Code Playgroud) 我是xml解析和Python的新手,所以请耐心等待.我正在使用lxml来解析wiki转储,但我只想要每个页面,它的标题和文本.
现在我有了这个:
from xml.etree import ElementTree as etree
def parser(file_name):
document = etree.parse(file_name)
titles = document.findall('.//title')
print titles
Run Code Online (Sandbox Code Playgroud)
目前,冠军没有返回任何东西.我已经看过像这样的前面的答案:ElementTree findall()返回空列表和lxml文档,但大多数事情似乎都是为解析HTML而定制的.
这是我的XML的一部分:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.20wmf9</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" …Run Code Online (Sandbox Code Playgroud) 我需要帮助才能理解为什么用xml.etree.ElementTree解析我的xml文件*会产生以下错误.
*我的测试xml文件包含阿拉伯字符.
任务:
打开并解析utf8_file.xml文件.
我的第一次尝试:
import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
xml_tree = etree.parse(utf8_file)
Run Code Online (Sandbox Code Playgroud)
结果1:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
我的第二次尝试:
import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
xml_tree = etree.fromstring(xml_string)
Run Code Online (Sandbox Code Playgroud)
结果2:
AttributeError: 'file' object has no attribute 'getiterator'
Run Code Online (Sandbox Code Playgroud)
请解释上述错误并评论可能的解决方案.
在使用ElementTree的Python 2.6中,在特定元素中获取XML(作为字符串)的好方法是什么,比如你在HTML和javascript中可以做什么innerHTML?
这是我开始使用的XML节点的简化示例:
<label attr="foo" attr2="bar">This is some text <a href="foo.htm">and a link</a> in embedded HTML</label>
Run Code Online (Sandbox Code Playgroud)
我想最终得到这个字符串:
This is some text <a href="foo.htm">and a link</a> in embedded HTML
Run Code Online (Sandbox Code Playgroud)
我已经尝试迭代父节点并连接子节点tostring(),但这只给了我子节点:
# returns only subnodes (e.g. <a href="foo.htm">and a link</a>)
''.join([et.tostring(sub, encoding="utf-8") for sub in node])
Run Code Online (Sandbox Code Playgroud)
我可以使用正则表达式破解解决方案,但是希望有一些不那么讨厌的东西:
re.sub("</\w+?>\s*?$", "", re.sub("^\s*?<\w*?>", "", et.tostring(node, encoding="utf-8")))
Run Code Online (Sandbox Code Playgroud) 使用lxml的ElementTree API实现从XML文档中完全删除给定元素很容易,但是我看不到用一些文本一致地替换元素的简单方法.例如,给出以下输入:
input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
Run Code Online (Sandbox Code Playgroud)
...你可以轻松删除每个<r>元素:
from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)
Run Code Online (Sandbox Code Playgroud)
但是,你将如何用文本替换每个元素,以获得输出:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
Run Code Online (Sandbox Code Playgroud)
在我看来,这是因为通过与文字ElementTree的API交易.text和 …
给定如下的XML:
<root>
<element>A</element>
<element>B</element>
</root>
Run Code Online (Sandbox Code Playgroud)
如何使用ElementTree将元素与内容A匹配并支持XPath?谢谢
我正在使用ElementTree在python中解析XML
import xml.etree.ElementTree as ET
tree = ET.parse('try.xml')
root = tree.getroot()
Run Code Online (Sandbox Code Playgroud)
我想解析给定目录中的所有'xml'文件.用户应该只输入目录名称,我应该能够遍历目录中的所有文件并逐个解析它们.有人可以告诉我这个方法.我正在使用Linux.
在操作XML时,我希望尽可能忠实地保留注释.
我设法保留了评论,但内容正在进行XML转义.
#!/usr/bin/env python
# add_host_to_tomcat.py
import xml.etree.ElementTree as ET
from CommentedTreeBuilder import CommentedTreeBuilder
parser = CommentedTreeBuilder()
if __name__ == '__main__':
filename = "/opt/lucee/tomcat/conf/server.xml"
# this is the important part: use the comment-preserving parser
tree = ET.parse(filename, parser)
# get the node to add a child to
engine_node = tree.find("./Service/Engine")
# add a node: Engine.Host
host_node = ET.SubElement(
engine_node,
"Host",
name="local.mysite.com",
appBase="webapps"
)
# add a child to new node: Engine.Host.Context
ET.SubElement(
host_node,
'Context',
path="",
docBase="/path/to/doc/base"
)
tree.write('out.xml')
Run Code Online (Sandbox Code Playgroud)
#!/usr/bin/env python …Run Code Online (Sandbox Code Playgroud) 以下测试读取文件,并使用lxml.html为页面生成DOM/Graph的叶节点.
但是,我也试图弄清楚如何从"字符串"获取输入.运用
lxml.html.fromstring(s)
Run Code Online (Sandbox Code Playgroud)
不起作用,因为这会生成"元素"而不是"ElementTree".
所以,我想弄清楚如何将元素转换为ElementTree.
思考
import lxml.html
from lxml import etree # trying this to see if needed
# to convert from element to elementtree
#cmd='cat osu_test.txt'
cmd='cat o2.txt'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
s=proc.communicate()[0].strip()
# s contains HTML not XML text
#doc = lxml.html.parse(s)
doc = lxml.html.parse('osu_test.txt')
doc1 = lxml.html.fromstring(s)
for node in doc.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
nt = etree.ElementTree(doc1) <<<<< doesn't work.. so what will??
for node in nt.iter():
if len(node) …Run Code Online (Sandbox Code Playgroud) 我一直在将一些原始xml.etree.ElementTree(ET)代码转换为lxml.etree(lxmlET).幸运的是,两者之间有很多相似之处.但是,我偶然发现了一些我在任何文档中都找不到的奇怪行为.它考虑后代节点的内部表示.
在ET中,iter()用于迭代Element的所有后代,可选地按标记名称进行过滤.因为我在文档中找不到关于此的任何细节,所以我期望lxmlET的类似行为.问题是,从测试我得出结论,在lxmlET中,有一个不同的树内部表示.
在下面的示例中,我迭代树中的节点并打印每个节点的子节点,但此外我还创建了这些子节点的所有不同组合并打印它们.这意味着,如果元素有子元素,('A', 'B', 'C')我会创建更改,即树[('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')].
# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy
def get_combination_trees(tree):
children = list(tree)
for i in range(1, len(children)):
for combination in combinations(children, i):
new_combo_tree = ET.Element(tree.tag, tree.attrib)
for recombined_child in combination:
new_combo_tree.append(recombined_child)
# when using lxml a deepcopy is required to make …Run Code Online (Sandbox Code Playgroud) elementtree ×10
python ×10
xml ×9
lxml ×3
parsing ×2
python-2.7 ×2
element ×1
python-3.x ×1
xml-parsing ×1
xpath ×1