Auf*_*ind 1 python tags lxml strip
如何判断etree.strip_tags()从给定标签元素中剥离所有可能的标签?
我是否必须自己绘制地图,例如:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Run Code Online (Sandbox Code Playgroud)
也许是一个我不知道的更优雅的方法?
输入示例:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Run Code Online (Sandbox Code Playgroud)
期望的输出:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
Run Code Online (Sandbox Code Playgroud)
甚至更好:
This is some text with multiple tags and sometimes they are nested.
Run Code Online (Sandbox Code Playgroud)
您可以使用该lxml.html.clean模块:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
Run Code Online (Sandbox Code Playgroud)