使用python的lxml剥离内联标记

Question

使用python的lxml剥离内联标记

我必须在xml文档中处理两种类型的内联标记.第一种类型的标签包含我想要保留的文本.我可以用lxml来解决这个问题

etree.tostring(element, method="text", encoding='utf-8')

Run Code Online (Sandbox Code Playgroud)

第二种类型的标签包括我不想保留的文本.我怎样才能摆脱这些标签及其文字？如果可能的话,我宁愿不使用正则表达式.

谢谢

Answer 1

Mar*_*air 10

我想strip_tags和strip_elements你在每一种情况下想要的东西.例如,这个脚本:

from lxml import etree

text = "<x>hello, <z>keep me</z> and <y>ignore me</y>, and here's some <y>more</y> text</x>"

tree = etree.fromstring(text)

print etree.tostring(tree, pretty_print=True)

# Remove the <z> tags, but keep their contents:
etree.strip_tags(tree, 'z')

print '-' * 72
print etree.tostring(tree, pretty_print=True)

# Remove all the <y> tags including their contents:
etree.strip_elements(tree, 'y', with_tail=False)

print '-' * 72
print etree.tostring(tree, pretty_print=True)

Run Code Online (Sandbox Code Playgroud)

...产生以下输出:

<x>hello, <z>keep me</z> and <y>ignore me</y>, and
here's some <y>more</y> text</x>

------------------------------------------------------------------------
<x>hello, keep me and <y>ignore me</y>, and
here's some <y>more</y> text</x>

------------------------------------------------------------------------
<x>hello, keep me and , and
here's some  text</x>

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，7 月前
查看次数：	3141 次
最近记录：	8 年，3 月前