lxml etree 获取元素之前的所有文本

Question

lxml etree 获取元素之前的所有文本

Mil*_*ell 2 python xml lxml elementtree xml-parsing

如何将etree 中某个元素之前的所有文本与该元素之后的文本分开？

from lxml import etree

tree = etree.fromstring('''
    <a>
        find
        <b>
            the
        </b>
        text
        <dd></dd>
        <c>
            before
        </c>
        <dd></dd>
        and after
    </a>
''')

Run Code Online (Sandbox Code Playgroud)

我想要什么？在此示例中，<dd>标签是分隔符，并且对于所有标签

for el in tree.findall('.//dd'):

Run Code Online (Sandbox Code Playgroud)

我想要它们之前和之后的所有文本：

[
    {
        el : <Element dd at 0xsomedistinctadress>,
        before : 'find the text',
        after : 'before and after'
    },
    {
        el : <Element dd at 0xsomeotherdistinctadress>,
        before : 'find the text before',
        after : 'and after'
    }
]

Run Code Online (Sandbox Code Playgroud)

我的想法是在树中使用某种占位符，用它替换标签<dd>，然后在该占位符处剪切字符串，但我需要与实际元素的对应关系。

Answer 1

ale*_*cxe 5

可能有更简单的方法，但我会使用以下 XPath 表达式：

preceding-sibling::*/text()|preceding::text()
following-sibling::*/text()|following::text()

Run Code Online (Sandbox Code Playgroud)

示例实现（绝对违反了DRY原则）：

def get_text_before(element):
    for item in element.xpath("preceding-sibling::*/text()|preceding-sibling::text()"):
        item = item.strip()
        if item:
            yield item

def get_text_after(element):
    for item in element.xpath("following-sibling::*/text()|following-sibling::text()"):
        item = item.strip()
        if item:
            yield item

for el in tree.findall('.//dd'):
    before = " ".join(get_text_before(el))
    after = " ".join(get_text_after(el))

    print {
        "el": el,
        "before": before,
        "after": after
    }

Run Code Online (Sandbox Code Playgroud)

印刷：

{'el': <Element dd at 0x10af81488>, 'after': 'before and after', 'before': 'find the text'}
{'el': <Element dd at 0x10af81200>, 'after': 'and after', 'before': 'find the text before'}

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，5 月前
查看次数：	1467 次
最近记录：	10 年，5 月前