tro*_*war 4 html php dom web-crawler symfony
用于解析的 HTML 5 示例:
<div id="orderDetails">
<div> ... any number of blocks with unnecessary stuff ... </div>
<div>Label for important info</div>
<table> ... some other block type ... </table>
<div>Some very important info here</div>
<div> ... any number of blocks with unnecessary stuff ... </div>
</div>
Run Code Online (Sandbox Code Playgroud)
我的 PHP 代码如下所示:
$crawler = new \Symfony\Component\DomCrawler\Crawler($html);
$label = $crawler->filter('#orderDetails div:contains("Label for important info")');
$info = $label->parent()->next('div');
assert('Some very important info here' === $info->text(), 'Important info must be grabbed from HTML');
Run Code Online (Sandbox Code Playgroud)
但不幸的是爬虫没有方法parent和next。但是..它parents给了我所有父节点==所有我无法区分的div。
所以在这种情况下我有两个问题:
next/水平遍历 dom prev?谢谢。
经过深入研究源代码,我发现该方法nextAll()返回的不是“全部”,而是“一个”节点($node = $this->getNode(0);)。
这意味着如果我需要“当前之后的两个节点”,那么我必须编写$node->nextAll()->nextAll()->nextAll().
搞什么?!这是超级奇怪的命名约定(0_0)。
\n\n\n\n\n\n
\n- 如何获取当前节点的父节点?不是所有节点,而是“实际”节点!
\n
// This is only one parent node\n$parent = $node->parents();\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n\n\n
\n- 如何使用 next/prev 的类似方式水平遍历 dom?
\n
// This is only one node \xe2\x80\x93\xc2\xa0next after current\n$next = $node->nextAll();\n// This is only one node \xe2\x80\x93\xc2\xa0previous before current\n$prev = $node->nextAll();\n// This is only one node \xe2\x80\x93\xc2\xa0next after two from current\n$nextAfterTwo = $node->nextAll()->nextAll()->nextAll();\nRun Code Online (Sandbox Code Playgroud)\n\n因此,由于所需的实现确实存在,问题的功能解决方案如下所示:
\n\n/**\n * Returns sibling node that is after current and filtered with selector\n *\n * @param Crawler $start Node from which start traverse\n * @param string $selector CSS/XPath selector like in `Crawler::filter($selector)`\n *\n * @return Crawler Found node wrapped with Crawler\n *\n * @throws \\InvalidArgumentException When node not found\n */\nfunction getNextFiltered(Crawler $start, string $selector) : Crawler\n{\n $count = $start->parents()->count();\n $next = $start->nextAll();\n while ($count --> 0) {\n $filtered = $next->filter($selector);\n if ($filtered->count()) return $filtered;\n $next = $next->nextAll();\n }\n\n throw new \\InvalidArgumentException(\'No node found\');\n}\nRun Code Online (Sandbox Code Playgroud)\n\n在我的例子中:
\n\n$crawler = new Crawler($html);\n$label = $crawler->filter(\'#orderDetails div:contains("Label for important info")\');\n$info = getNextFiltered($label, \'div\');\nassert(\'Some very important info here\' === $info->text(), \'Important info must be grabbed from HTML\');\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
5292 次 |
| 最近记录: |