通过XPath在节点之间提取文本

Question

通过XPath在节点之间提取文本

我正试图通过XPath阅读网页的特定部分.该页面形式不是很好,但我无法改变...

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

Run Code Online (Sandbox Code Playgroud)

我想提取各种项目的文本,即标题div之间的文本(例如'这是第一项的文本'.).到目前为止我已经使用了这个XPath表达式:

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

Run Code Online (Sandbox Code Playgroud)

但是,我不能对结束项目名称进行硬编码,因为在页面中我想要刮取项目的顺序不同(例如,"第一项"可能后跟"第三项").

任何有关如何调整我的XPath查询的帮助将不胜感激.

Answer 1

Mic*_*ijk 1

为了完整起见，最终查询由整个线程中的各种建议组成：

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，7 月前
查看次数：	4355 次
最近记录：	13 年，7 月前