如何从页面源"抓取"内容?

Joe*_*ani 6 php scrape

我有这个代码获取页面的HTML源代码:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);
Run Code Online (Sandbox Code Playgroud)

我想从中搜集一些内容.例如,假设页面的源包含:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />
Run Code Online (Sandbox Code Playgroud)

有没有办法可以从源代码中删除它并将其存储在变量中,所以它看起来像这样:

technorati.com连接失败
icerocket.com连接失败
eblogs.com完成
Ect.

因为页面是动态的,这就是我遇到问题的原因.我可以搜索源中的每个站点吗?但那我怎么得到它之后的结果呢?(连接失败/完成)非常
感谢您的帮助!

Che*_*ron 15

我尝试使用简单的HTML DOM PHP库来抓取多个站点,可以在这里获得:http://simplehtmldom.sourceforge.net/

然后使用这样的代码:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>
Run Code Online (Sandbox Code Playgroud)

这导致类似于:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents
Run Code Online (Sandbox Code Playgroud)