Nokogiri 和 Xpath：查找两个标签之间的所有文本

Question

Nokogiri 和 Xpath：查找两个标签之间的所有文本

我不确定这是语法还是版本差异的问题，但我似乎无法弄清楚这一点。我想将（非关闭）内的数据td从h2标签带到h3标签。这是 HTML 的样子。

<td valign="top" width="350">
    <br><h2>NameIWant</h2><br>
    <br>Town<br>

    PhoneNumber<br>
    <a href="mailto:emailIwant@nowhere.com" class="links">emailIwant@nowhere.com</a>
    <br>
    <a href="http://websiteIwant.com" class="links">websiteIwant.com</a>
    <br><br>    
    <br><img src="images/spacer.gif"/><br>

    <h3><b>I want to stop before this!</b></h3>
    Lorem Ipsum Yadda Yadda<br>
    <img src="images/spacer.gif" border="0" width="20" height="11" alt=""/><br>
    <td width="25">
        <img src="images/spacer.gif" border="0" width="20" height="8" alt=""/>
        <td valign="top" width="200"><img src="images/spacer.gif"/>
            <br>
            <br>

            <table cellspacing="0" cellpadding="0" border="0"/>205"&gt;<tr><td>
                <a href="http://dontneedthis.com">
                </a></td></tr><br>
            <table border="0" cellpadding="3" cellspacing="0" width="200">
            ...

Run Code Online (Sandbox Code Playgroud)

在<td valign>不关闭，直到最底层，我认为可能是为什么我有问题的网页。

我的 Ruby 代码如下所示：

require 'open-uri'
require 'nokogiri'

@doc = Nokogiri::XML(open("http://www.url.com"))

content = @doc.css('//td[valign="top"] [width="350"]')

name = content.xpath('//h2').text
puts name // Returns NameIwant

townNumberLinks = content.search('//following::h2')
puts content // Returns <h2> NameIWant </h2>

Run Code Online (Sandbox Code Playgroud)

据我了解，以下语法应该“在当前节点的结束标记之后选择文档中的所有内容”。如果我尝试使用preceding像：

townNumberLinks = content.search('//preceding::h3')
// I get: <h3><b>I want to stop before this!</b></h3>

Run Code Online (Sandbox Code Playgroud)

希望我说清楚我要做什么。谢谢！

Answer 1

hel*_*cha 5

这不是微不足道的。在您选择的节点 (the td)的上下文中，要获取两个元素之间的所有内容，您需要执行这两个集合的交集：

集合A：第一个之前的所有节点： h3//h3[1]/preceding::node()

集合B：第一个节点之后的所有节点： h2//h2[1]/following::node()

要执行交集，您可以使用Kaysian 方法（在Michael Kay提出之后）。基本公式为：

A[count(.|B) = count(B)]
Run Code Online (Sandbox Code Playgroud)
将它应用到你的集合，如上定义，其中A =//h3[1]/preceding::node()和B = //h2[1]/following::node()，我们有：

//h3[1]/preceding::node()[ count( . | //h2[1]/following::node()) = count(//h2[1]/following::node()) ]
Run Code Online (Sandbox Code Playgroud)
这将选择所有元素和文本节点，从标签<br>之后的第一个开始，到最后一个之后</h2>的空白文本节点<br>，就在下一个<h3>标签之前。

您可以轻松地仅h2h3选择表达式中的文本节点和替换node()为text()。这将返回两个标题之间的所有文本节点（包括空格和换行符）：

//h3[1]/preceding::text()[ count( . | //h2[1]/following::text()) = count(//h2[1]/following::text()) ]
Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，5 月前
查看次数：	2141 次
最近记录：	11 年，5 月前