如何通过symfony爬虫获取当前父节点之后的下一个节点?

tro*_*war 4 html php dom web-crawler symfony

用于解析的 HTML 5 示例:

<div id="orderDetails">
    <div> ... any number of blocks with unnecessary stuff ... </div>
    <div>Label for important info</div>
    <table> ... some other block type ... </table>
    <div>Some very important info here</div>
    <div> ... any number of blocks with unnecessary stuff ... </div>
</div>
Run Code Online (Sandbox Code Playgroud)

我的 PHP 代码如下所示:

$crawler = new \Symfony\Component\DomCrawler\Crawler($html);
$label = $crawler->filter('#orderDetails div:contains("Label for important info")');
$info = $label->parent()->next('div');
assert('Some very important info here' === $info->text(), 'Important info must be grabbed from HTML');
Run Code Online (Sandbox Code Playgroud)

但不幸的是爬虫没有方法parentnext。但是..它parents给了我所有父节点==所有我无法区分的div。

所以在这种情况下我有两个问题:

  1. 如何获取当前节点的父节点?不是所有节点,而是“实际”节点!
  2. 如何使用类似next/水平遍历 dom prev

谢谢。

tro*_*war 6

故事

\n\n

经过深入研究源代码,我发现该方法nextAll()返回的不是“全部”,而是“一个”节点($node = $this->getNode(0);)。

\n\n

这意味着如果我需要“当前之后的两个节点”,那么我必须编写$node->nextAll()->nextAll()->nextAll().

\n\n

搞什么?!这是超级奇怪的命名约定(0_0)。

\n\n

答案

\n\n
\n
    \n
  1. 如何获取当前节点的父节点?不是所有节点,而是“实际”节点!
  2. \n
\n
\n\n
// This is only one parent node\n$parent = $node->parents();\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n
    \n
  1. 如何使用 next/prev 的类似方式水平遍历 dom?
  2. \n
\n
\n\n
// This is only one node \xe2\x80\x93\xc2\xa0next after current\n$next = $node->nextAll();\n// This is only one node \xe2\x80\x93\xc2\xa0previous before current\n$prev = $node->nextAll();\n// This is only one node \xe2\x80\x93\xc2\xa0next after two from current\n$nextAfterTwo = $node->nextAll()->nextAll()->nextAll();\n
Run Code Online (Sandbox Code Playgroud)\n\n

具体代码解决方案

\n\n

因此,由于所需的实现确实存在,问题的功能解决方案如下所示:

\n\n
/**\n * Returns sibling node that is after current and filtered with selector\n *\n * @param Crawler $start    Node from which start traverse\n * @param string  $selector CSS/XPath selector like in `Crawler::filter($selector)`\n *\n * @return Crawler Found node wrapped with Crawler\n *\n * @throws \\InvalidArgumentException When node not found\n */\nfunction getNextFiltered(Crawler $start, string $selector) : Crawler\n{\n    $count = $start->parents()->count();\n    $next  = $start->nextAll();\n    while ($count --> 0) {\n        $filtered = $next->filter($selector);\n        if ($filtered->count()) return $filtered;\n        $next = $next->nextAll();\n    }\n\n    throw new \\InvalidArgumentException(\'No node found\');\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

在我的例子中:

\n\n
$crawler = new Crawler($html);\n$label   = $crawler->filter(\'#orderDetails div:contains("Label for important info")\');\n$info    = getNextFiltered($label, \'div\');\nassert(\'Some very important info here\' === $info->text(), \'Important info must be grabbed from HTML\');\n
Run Code Online (Sandbox Code Playgroud)\n