使用 BeautifulSoup 的不同 XML 元素名称列表

Question

使用 BeautifulSoup 的不同 XML 元素名称列表

ber*_*nie 2 python xml tags beautifulsoup

我正在使用 BeautifulSoup 来解析 XML 文档。是否有一种直接的方法来获取文档中使用的不同元素名称的列表？

例如，如果这是文档：

<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

Run Code Online (Sandbox Code Playgroud)

我想得到：注意，到，从，标题，正文

Answer 1

ale*_*cxe 5

您可以使用find_all()并获取找到的.name每个标签：

from bs4 import BeautifulSoup

data = """<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
"""

soup = BeautifulSoup(data, 'xml')
print [tag.name for tag in soup.find_all()]

Run Code Online (Sandbox Code Playgroud)

印刷：

['note', 'to', 'from', 'heading', 'body']

Run Code Online (Sandbox Code Playgroud)

请注意，要使其正常工作，您需要lxml安装模块，因为根据文档：

目前，唯一支持的 XML 解析器是 lxml。如果你没有安装 lxml，请求一个 XML 解析器不会给你一个，请求“lxml”也不会工作。

而且，为了跟进这一点，为什么不直接使用特殊的 XML 解析器呢？

示例，使用lxml：

from lxml import etree

data = """<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
"""

tree = etree.fromstring(data)
print [item.tag for item in tree.xpath('//*')]

Run Code Online (Sandbox Code Playgroud)

印刷：

['note', 'to', 'from', 'heading', 'body']

Run Code Online (Sandbox Code Playgroud)

为了遵循这一点，为什么要使用第三方来完成如此简单的任务？

示例，使用xml.etree.ElementTree来自标准库：

from xml.etree.ElementTree import fromstring, ElementTree

data = """<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
"""

tree = ElementTree(fromstring(data))
print [item.tag for item in tree.getiterator()]

Run Code Online (Sandbox Code Playgroud)

印刷：

['note', 'to', 'from', 'heading', 'body']

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，4 月前
查看次数：	1877 次
最近记录：	11 年，4 月前