Qua*_*mis 11 html php web-scraping
以下图书馆有哪些优点和缺点?
从上面我已经使用了QP并且它无法解析无效的HTML和simpleDomParser,它做得很好,但是由于对象模型它有点泄漏内存.但是$object->clear(); unset($object);当你不再需要一个物体时,你可以通过调用来控制它.
还有刮刀吗?你对他们有什么经历?我将把它变成一个社区维基,我们可以建立一个有用的库列表,在抓取时可能很有用.
我根据拜伦的答案做了一些测试:
<?
include("lib/simplehtmldom/simple_html_dom.php");
include("lib/phpQuery/phpQuery/phpQuery.php");
echo "<pre>";
$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['pq'] = $data['dom'] = $data['simple_dom'] = array();
$timer_start = microtime(true);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$data['dom'][] = $node->getAttribute("href");
}
foreach($x->query("//img") as $node)
{
$data['dom'][] = $node->getAttribute("src");
}
foreach($x->query("//input") as $node)
{
$data['dom'][] = $node->getAttribute("name");
}
$dom_time = microtime(true) - $timer_start;
echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";
$timer_start = microtime(true);
$doc = phpQuery::newDocument($html);
foreach( $doc->find("a") as $node)
{
$data['pq'][] = $node->href;
}
foreach( $doc->find("img") as $node)
{
$data['pq'][] = $node->src;
}
foreach( $doc->find("input") as $node)
{
$data['pq'][] = $node->name;
}
$time = microtime(true) - $timer_start;
echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";
$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
$data['simple_dom'][] = $node->href;
}
foreach( $simple_dom->find("img") as $node)
{
$data['simple_dom'][] = $node->src;
}
foreach( $simple_dom->find("input") as $node)
{
$data['simple_dom'][] = $node->name;
}
$simple_dom_time = microtime(true) - $timer_start;
echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";
echo "</pre>";
Run Code Online (Sandbox Code Playgroud)
得到了
dom: 0.00359296798706 . Got 115 items
PQ: 0.010568857193 . Got 115 items
simple_dom: 0.0770139694214 . Got 115 items
Run Code Online (Sandbox Code Playgroud)
我过去常常使用简单的html dom直到一些明亮的SO'ers向我展示了轻哈利路亚.
只需使用内置的DOM函数.它们是用C语言编写的,也是PHP核心的一部分.它们比任何第三方解决方案更快捷.使用firebug,获取XPath查询非常简单.这个简单的改变使我的基于php的刮刀运行得更快,同时节省了宝贵的时间.
我的刮刀过去需要大约60兆字节来刮掉10个与卷曲同步的网站.这就是你提到的简单的html dom内存修复.
现在我的php进程永远不会超过8兆字节.
强烈推荐.
编辑
好的,我做了一些基准测试.内置dom至少快一个数量级.
Built in php DOM: 0.007061
Simple html DOM: 0.117781
<?
include("../lib/simple_html_dom.php");
$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['dom'] = $data['simple_dom'] = array();
$timer_start = microtime(true);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$data['dom'][] = $node->getAttribute("href");
}
foreach($x->query("//img") as $node)
{
$data['dom'][] = $node->getAttribute("src");
}
foreach($x->query("//input") as $node)
{
$data['dom'][] = $node->getAttribute("name");
}
$dom_time = microtime(true) - $timer_start;
echo "built in php DOM : $dom_time\n";
$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
$data['simple_dom'][] = $node->href;
}
foreach( $simple_dom->find("img") as $node)
{
$data['simple_dom'][] = $node->src;
}
foreach( $simple_dom->find("input") as $node)
{
$data['simple_dom'][] = $node->name;
}
$simple_dom_time = microtime(true) - $timer_start;
echo "simple html DOM : $simple_dom_time\n";
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2912 次 |
| 最近记录: |