如何匹配文本节点,然后使用XPath跟随父节点

Mat*_*Mat 14 html python xpath lxml

我试图用XPath解析一些HTML.按照下面的简化XML示例,我想匹配字符串'Text 1',然后获取相关content节点的内容.

<doc>
    <block>
        <title>Text 1</title>
        <content>Stuff I want</content>
    </block>

    <block>
        <title>Text 2</title>
        <content>Stuff I don't want</content>
    </block>
</doc>
Run Code Online (Sandbox Code Playgroud)

我的Python代码抛出一个摇摆不定的:

>>> from lxml import etree
>>>
>>> tree = etree.XML("<doc><block><title>Text 1</title><content>Stuff 
I want</content></block><block><title>Text 2</title><content>Stuff I d
on't want</content></block></doc>")
>>>
>>> # get all titles
... tree.xpath('//title/text()')
['Text 1', 'Text 2']
>>>
>>> # match 'Text 1'
... tree.xpath('//title/text()="Text 1"')
True
>>>
>>> # Follow parent from selected nodes
... tree.xpath('//title/text()/../..//text()')
['Text 1', 'Stuff I want', 'Text 2', "Stuff I don't want"]
>>>
>>> # Follow parent from selected node
... tree.xpath('//title/text()="Text 1"/../..//text()')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1330, in lxml.etree._Element.xpath (src/
lxml/lxml.etree.c:14542)
  File "xpath.pxi", line 287, in lxml.etree.XPathElementEvaluator.__ca
ll__ (src/lxml/lxml.etree.c:90093)
  File "xpath.pxi", line 209, in lxml.etree._XPathEvaluatorBase._handl
e_result (src/lxml/lxml.etree.c:89446)
  File "xpath.pxi", line 194, in lxml.etree._XPathEvaluatorBase._raise
_eval_error (src/lxml/lxml.etree.c:89281)
lxml.etree.XPathEvalError: Invalid type
Run Code Online (Sandbox Code Playgroud)

这在XPath中可行吗?我是否需要以不同的方式表达我想要做的事情?

Joh*_*iss 23

你想要那个吗?

//title[text()='Text 1']/../content/text()
Run Code Online (Sandbox Code Playgroud)

  • 你也可以使用// block [title ='Text 1']/content来获取相关的内容节点 (2认同)

Dim*_*hev 16

用途:

string(/*/*/title[. = 'Text 1']/following-sibling::content)
Run Code Online (Sandbox Code Playgroud)

与目前公认的JohannesWeiß解决方案相比,这至少代表了两项改进:

  1. 避免使用非常昂贵的缩写"//"(通常导致整个XML文档被扫描),因为无论何时预先知道XML文档的结构,都应该这样做.

  2. 没有返回到父级(避免位置步骤"/ ..")

  • `/*/*/`做什么?我在一个相当大的文档上尝试它,它看起来像`//`一样慢. (2认同)
  • @dentarg:`/*/*`选择所有元素作为文档顶部元素的子元素.它比`// someName`更快,它遍历整个文档并选择名为`"someName"`的每个元素.在这个答案中,我们可以使用更高效的表达式:`string(/*/*/title [.='Text 1'] [1]/following-sibling :: content)`答案中的表达式不应该是效率较低,给定一个优化良好的XPath处理器 - 因为每当`string()`函数提供一个节点集的参数时,它只产生该节点集的第一个节点的字符串值. (2认同)