Python元素树 - 从元素中提取文本,剥离标签

Question

Python元素树 - 从元素中提取文本,剥离标签

Tre*_*ing 8 python elementtree xml-parsing

使用Python中的ElementTree,如何从节点中提取所有文本,剥离该元素中的任何标记并仅保留文本？

例如,假设我有以下内容:

<tag>
  Some <a>example</a> text
</tag>

Run Code Online (Sandbox Code Playgroud)

我想回来Some example text.我该怎么做呢？到目前为止,我所采取的方法都有相当严重的后果.

Answer 1

Ben*_*ueg 19

如果您在Python 3.2+下运行,则可以使用itertext.

itertext 创建一个文本迭代器,它按文档顺序循环遍历此元素和所有子元素,并返回所有内部文本:

import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

# -> 'Some example text'

Run Code Online (Sandbox Code Playgroud)

如果您运行的是较低版本的Python,则可以通过将其附加到类中来重用实现itertext()Element,之后您可以像上面一样调用它:

# original implementation of .itertext() for Python 2.7
def itertext(self):
    tag = self.tag
    if not isinstance(tag, basestring) and tag is not None:
        return
    if self.text:
        yield self.text
    for e in self:
        for s in e.itertext():
            yield s
        if e.tail:
            yield e.tail

# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
    ET.Element.itertext = itertext

xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

# -> 'Some example text'

Run Code Online (Sandbox Code Playgroud)

谢谢，找这个有一段时间了！ (2认同)

Answer 2

aba*_*ert 5

如文档所述，如果您只想读取文本，而没有任何中间标签，则必须以正确的顺序递归连接所有text和tail属性。

但是，最近可用的版本（包括2.7和3.2中的stdlib中的版本，而不是2.6或3.1中的版本，以及两者ElementTree以及lxmlPyPI上的当前发行版本）都可以通过以下tostring方法自动为您完成此操作：

>>> s = '''<tag>
...   Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n  Some example text\n'

Run Code Online (Sandbox Code Playgroud)

如果您还想从文本中去除空格，则需要手动进行。在您的简单情况下，这很容易：

>>> ElementTree.tostring(s, method='text').strip()
'Some example text'

Run Code Online (Sandbox Code Playgroud)

但是，在更复杂的情况下，如果要去除中间标记内的空格，则可能不得不依靠递归处理texts和tails。那不是太难。您只需要记住要处理属性可能为的可能性None。例如，这是您可以将自己的代码挂接到的骨架：

def textify(t):
    s = []
    if t.text:
        s.append(t.text)
    for child in t.getchildren():
        s.extend(textify(child))
    if t.tail:
        s.append(t.tail)
    return ''.join(s)

Run Code Online (Sandbox Code Playgroud)

此版本仅在text和tail保证为str或时可用None。对于您手动构建的树，这不能保证是正确的。

归档时间：	12 年，4 月前
查看次数：	10082 次
最近记录：	7 年，3 月前