使用Python和lxml仅剥离具有特定属性/值的标记

Bus*_*gue 10 python lxml

我熟悉etree strip_tagsstrip_elements方法,但我正在寻找一种直接的剥离标签(并留下其内容)只包含特定属性/值的方法.

例如:我想从树(l)中剥离具有属性/值的所有spandiv标记(或其他元素)(保留元素的内容就像那样).与此同时,那些相同的元素具备应保持不变.xhtmclass='myclass'strip_tagsclass='myclass'

相反:我想要一种剥去所有"裸体" spansdivs树木的方法.仅表示那些完全没有属性的spans/ divs(或任何其他元素).留下那些具有属性(任何)的相同元素不变.

我觉得我错过了一些明显的东西,但是我一直在寻找没有任何运气的时间.

Luk*_*raf 12

HTML

lxmlHTML元素有一个方法drop_tag(),您可以调用解析的树中的任何元素lxml.html.

它的作用类似strip_tags,因为它移除元素,但保留了文本,它可以被称为的元素-这意味着你可以轻松地选择你不感兴趣,与元素的XPath在他们的表情,然后循环,删除它们:

doc.html

<html>
    <body>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>
Run Code Online (Sandbox Code Playgroud)

strip.py

from lxml import etree
from lxml import html

doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")

for span in spans_with_attrs:
    span.drop_tag()

print etree.tostring(doc)
Run Code Online (Sandbox Code Playgroud)

输出:

<html>
    <body>
        <div>This is some Text.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get removed as well.</div>
        <div>Nested elements will <b>be</b> left alone.</div>
        <div>Unless they also match.</div>
    </body>
</html>
Run Code Online (Sandbox Code Playgroud)

在这种情况下,XPath表达式//span[@attr='foo']选择span具有attrvalue 属性的所有元素foo.有关如何构造XPath表达式的更多详细信息,请参阅此XPath教程.

XML/XHTML

编辑:我刚刚注意到你在你的问题中特别提到了XHTML,根据文档更好地解析为XML.不幸的是,该drop_tag()方法实际上只适用于HTML文档中的元素.

所以对于XML来说,它有点复杂:

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
Run Code Online (Sandbox Code Playgroud)

strip.py

from lxml import etree


def strip_nodes(nodes):
    for node in nodes:
        text_content = node.xpath('string()')

        # Include tail in full_text because it will be removed with the node
        full_text = text_content + (node.tail or '')

        parent = node.getparent()
        prev = node.getprevious()
        if prev:
            # There is a previous node, append text to its tail
            prev.tail += full_text
        else:
            # It's the first node in <parent/>, append to parent's text
            parent.text = (parent.text or '') + full_text
        parent.remove(node)


doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)

print etree.tostring(doc)
Run Code Online (Sandbox Code Playgroud)

输出:

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this first span should <span>be</span> removed.</node>
</document>
Run Code Online (Sandbox Code Playgroud)

如您所见,这将使用递归文本内容替换节点及其所有子节点.我真的希望这是你想要的,否则事情变得更加复杂;-)

注意上次编辑已更改相关代码.