som*_*off 2 python parsing lxml html-parsing
我正在使用lxml来解析html并对其进行编辑以生成新文档.本质上,我试图使用它有点像javascript DOM - 我知道这不是真正的预期用途,但到目前为止它的大部分工作都很好.
目前,我使用iterdescendants()获取可迭代的元素列表,然后依次处理每个元素.
但是,如果在迭代期间删除了一个元素,则仍会考虑其子元素,因为删除不会像您期望的那样影响迭代.为了得到我想要的结果,这个hack有效:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
Run Code Online (Sandbox Code Playgroud)
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
Run Code Online (Sandbox Code Playgroud)
这输出:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
Run Code Online (Sandbox Code Playgroud)
这感觉就像一个令人讨厌的黑客 - 有没有更合理的方式来实现这个使用库?
为了简化操作,您可以使用lxml对XPath中正则表达式的支持来查找和终止不需要的节点,而无需迭代所有后代.
这会产生与脚本相同的结果:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(@class, '\bunwanted\b') or re:test(@id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1067 次 |
| 最近记录: |