如何在xml标记中去除所有子标记,但是在python中使用lxml将文本合并到parens中?

Auf*_*ind 1 python tags lxml strip

如何判断etree.strip_tags()从给定标签元素中剥离所有可能的标签?

我是否必须自己绘制地图,例如:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Run Code Online (Sandbox Code Playgroud)

也许是一个我不知道的更优雅的方法?

输入示例:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Run Code Online (Sandbox Code Playgroud)

期望的输出:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
Run Code Online (Sandbox Code Playgroud)

甚至更好:

This is some text with multiple tags and sometimes they are nested.
Run Code Online (Sandbox Code Playgroud)

ars*_*ars 5

您可以使用该lxml.html.clean模块:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
Run Code Online (Sandbox Code Playgroud)