我熟悉etree strip_tags和strip_elements方法,但我正在寻找一种直接的剥离标签(并留下其内容)只包含特定属性/值的方法.
例如:我想从树(l)中剥离具有属性/值的所有span或div标记(或其他元素)(保留元素的内容就像那样).与此同时,那些相同的元素不具备应保持不变.xhtmclass='myclass'strip_tagsclass='myclass'
相反:我想要一种剥去所有"裸体" spans或divs树木的方法.仅表示那些完全没有属性的spans/ divs(或任何其他元素).留下那些具有属性(任何)的相同元素不变.
我觉得我错过了一些明显的东西,但是我一直在寻找没有任何运气的时间.
Luk*_*raf 12
lxmlHTML元素有一个方法drop_tag(),您可以调用解析的树中的任何元素lxml.html.
它的作用类似strip_tags,因为它移除元素,但保留了文本,它可以被称为上的元素-这意味着你可以轻松地选择你不感兴趣,与元素的XPath在他们的表情,然后循环,删除它们:
doc.html
<html>
<body>
<div>This is some <span attr="foo">Text</span>.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get <span attr="foo">removed</span> as well.</div>
<div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
<div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
strip.py
from lxml import etree
from lxml import html
doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")
for span in spans_with_attrs:
span.drop_tag()
print etree.tostring(doc)
Run Code Online (Sandbox Code Playgroud)
输出:
<html>
<body>
<div>This is some Text.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get removed as well.</div>
<div>Nested elements will <b>be</b> left alone.</div>
<div>Unless they also match.</div>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
在这种情况下,XPath表达式//span[@attr='foo']选择span具有attrvalue 属性的所有元素foo.有关如何构造XPath表达式的更多详细信息,请参阅此XPath教程.
编辑:我刚刚注意到你在你的问题中特别提到了XHTML,根据文档更好地解析为XML.不幸的是,该drop_tag()方法实际上只适用于HTML文档中的元素.
所以对于XML来说,它有点复杂:
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
Run Code Online (Sandbox Code Playgroud)
strip.py
from lxml import etree
def strip_nodes(nodes):
for node in nodes:
text_content = node.xpath('string()')
# Include tail in full_text because it will be removed with the node
full_text = text_content + (node.tail or '')
parent = node.getparent()
prev = node.getprevious()
if prev:
# There is a previous node, append text to its tail
prev.tail += full_text
else:
# It's the first node in <parent/>, append to parent's text
parent.text = (parent.text or '') + full_text
parent.remove(node)
doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)
print etree.tostring(doc)
Run Code Online (Sandbox Code Playgroud)
输出:
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first span should <span>be</span> removed.</node>
</document>
Run Code Online (Sandbox Code Playgroud)
如您所见,这将使用递归文本内容替换节点及其所有子节点.我真的希望这是你想要的,否则事情变得更加复杂;-)
注意上次编辑已更改相关代码.