从混合内容文档中提取多个xml树

Mau*_*ver 7 python xml parsing

我正在尝试使用Python从混合内容文档中提取多个XML元素.用例是包含电子邮件文本但包含多个XML树的电子邮件.

这是示例文档:

Email text email text email text email text.

email signature email signature.

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

Email text email text email text email text.

email signature email signature.

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

Email text email text email text email text.

email signature email signature.

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

Email text email text email text email text.

email signature email signature.
Run Code Online (Sandbox Code Playgroud)

我想提取XML树,以便它们可以在for循环中由XML解析器解析.我已经完善了对XML的解析,如果我采用其中一个XML树并直接解析它,它的工作就像一个魅力.

有关如何提取XML树的任何建议吗?此示例也过于简化,电子邮件文本和签名在我的每个示例中都不同,因此唯一可靠的文本是XML树的开头和结尾.

sto*_*vfl 4

问题:我想提取 XML 树,以便 XML 解析器可以解析它们

您真的想要获得多个 XML 树吗?
我想建议制作一棵包含多个子元素的 XML 树<book

尽管如此,这就是您想要的:

xml_tag = "<?xml"
catalog_end_tag = "</catalog>"

xml_tree = []
_xml = False
with open('test/Mixed_email_xml') as fh:
    while True:
        line = fh.readline()
        if not line: break

        if line.find(xml_tag) >=0:
            _xml = True

        if _xml:
            xml_tree.append(line)

        if line.find(catalog_end_tag) >=0:
            _xml = False

for line in xml_tree:
    print('{}'.format(line[:-1]))
Run Code Online (Sandbox Code Playgroud)

使用Python测试:3.4.2