如何使用python和lxml从html中删除类属性?
我有:
<p class="DumbClass">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
Run Code Online (Sandbox Code Playgroud)
我想要:
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
Run Code Online (Sandbox Code Playgroud)
我已经检查了lxml.html.clean.Cleaner但是,它没有一个方法来去除类属性.safe_attrs_only=True但是,您可以设置,这不会删除class属性.
重要的搜索结果似乎没有用.我认为class在html和python中使用的事实进一步混淆了搜索结果.许多结果似乎也严格遵守xml.
我对其他提供人性化界面的python模块持开放态度.
非常感谢.
感谢@Dan Roberts在下面的回答,我提出了以下解决方案.为将来到达这里的人们提出尝试解决同样的问题.
import lxml.html
# Our html string we want to remove the class attribute from
html_string = '<p class="DumbClass">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>'
# Parse the html
html = lxml.html.fromstring(html_string)
# Print out our "Before"
print lxml.html.tostring(html)
# .xpath below gives us a list of all elements that have a class attribute
# xpath syntax explained:
# // = select all tags that match our expression regardless of location in doc
# * = match any tag
# [@class] = match all class attributes
for tag in html.xpath('//*[@class]'):
# For each element with a class attribute, remove that class attribute
tag.attrib.pop('class')
# Print out our "After"
print lxml.html.tostring(html)
Run Code Online (Sandbox Code Playgroud)
Dan*_*rts 15
我现在无法测试这个,但这似乎是一般的想法
for tag in node.xpath('//*[@class]'):
tag.attrib.pop('class')
Run Code Online (Sandbox Code Playgroud)