我想使用curl 抓取此Google搜索结果页面的内容.我一直在尝试设置不同的用户代理,并设置其他选项,但我似乎无法获取该页面的内容,因为我经常被重定向或我得到"页面移动"错误.
我相信它与查询字符串在某处被编码的事实有关,但我真的不确定如何解决这个问题.
//$url is the same as the link above
$ch = curl_init();
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
curl_setopt ($ch,CURLOPT_TIMEOUT,120);
curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
echo curl_exec ($ch);
Run Code Online (Sandbox Code Playgroud)
我需要做些什么来让我的PHP代码显示页面的确切内容,就像我在浏览器上看到的那样?我错过了什么?有人能指出我正确的方向吗?
我在SO上看过类似的问题,但没有一个答案可以帮助我.
编辑:
我尝试使用Selenium WebDriver打开链接,它提供与cURL相同的结果.我仍然认为这与查询字符串中有特殊字符的事实有关,这些特征字符在进程中的某处被搞砸了.
小智 61
这是如何:
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTOMREQUEST =>"GET", //set request type post or get
CURLOPT_POST =>false, //set to GET
CURLOPT_USERAGENT => $user_agent, //set user agent
CURLOPT_COOKIEFILE =>"cookie.txt", //set cookie file
CURLOPT_COOKIEJAR =>"cookie.txt", //set cookie jar
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
Run Code Online (Sandbox Code Playgroud)
例
//Read a web page and check for errors:
$result = get_web_page( $url );
if ( $result['errno'] != 0 )
... error: bad url, timeout, redirect loop ...
if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...
$page = $result['content'];
Run Code Online (Sandbox Code Playgroud)
小智 12
对于模拟最人类行为的现实方法,您可能希望在curl选项中添加引用.您可能还想在curl选项中添加follow_location.相信我,无论谁说谷歌的结果是不可能的,都是完全的蠢事,应该把他/她的电脑扔到墙上,希望永远不再回到互联网上.您可以使用PHP cURL或Python中的libCURL模拟使用您自己的浏览器执行"IRL"的所有操作.你只需要做更多的cURLS来获得buff.然后你会明白我的意思.:)
$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
$ch = curl_init();
curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
curl_setopt($ch, CURLOPT_URL, urlencode($url));
$response = curl_exec($ch);
curl_close($ch);
Run Code Online (Sandbox Code Playgroud)
尝试这个:
$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
curl_setopt($ch, CURLOPT_URL, urlencode($url));
$response = curl_exec($ch);
curl_close($ch);
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
138875 次 |
| 最近记录: |