如何在xml标记中去除所有子标记,但是在python中使用lxml将文本合并到parens中？

Question

如何在xml标记中去除所有子标记,但是在python中使用lxml将文本合并到parens中？

如何判断etree.strip_tags()从给定标签元素中剥离所有可能的标签？

我是否必须自己绘制地图,例如:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

Run Code Online (Sandbox Code Playgroud)

也许是一个我不知道的更优雅的方法？

输入示例:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

Run Code Online (Sandbox Code Playgroud)

期望的输出:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

Run Code Online (Sandbox Code Playgroud)

甚至更好:

This is some text with multiple tags and sometimes they are nested.

Run Code Online (Sandbox Code Playgroud)

Answer 1

ars*_*ars 5

您可以使用该lxml.html.clean模块:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，7 月前
查看次数：	2042 次
最近记录：	11 年，9 月前