如何使用cURL获取页面内容？

Question

如何使用cURL获取页面内容？

我想使用curl 抓取此Google搜索结果页面的内容.我一直在尝试设置不同的用户代理,并设置其他选项,但我似乎无法获取该页面的内容,因为我经常被重定向或我得到"页面移动"错误.

我相信它与查询字符串在某处被编码的事实有关,但我真的不确定如何解决这个问题.

    //$url is the same as the link above
    $ch = curl_init();
    $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
    curl_setopt ($ch,CURLOPT_TIMEOUT,120);
    curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
    curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
    curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
    echo curl_exec ($ch);

Run Code Online (Sandbox Code Playgroud)

我需要做些什么来让我的PHP代码显示页面的确切内容,就像我在浏览器上看到的那样？我错过了什么？有人能指出我正确的方向吗？

我在SO上看过类似的问题,但没有一个答案可以帮助我.

编辑:

我尝试使用Selenium WebDriver打开链接,它提供与cURL相同的结果.我仍然认为这与查询字符串中有特殊字符的事实有关,这些特征字符在进程中的某处被搞砸了.

Answer 1

小智 61

这是如何:

   /**
     * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
     * array containing the HTTP server response header fields and content.
     */
    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
    }

Run Code Online (Sandbox Code Playgroud)

例

//Read a web page and check for errors:

$result = get_web_page( $url );

if ( $result['errno'] != 0 )
    ... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
    ... error: no page, no permissions, no service ...

$page = $result['content'];

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 12

对于模拟最人类行为的现实方法,您可能希望在curl选项中添加引用.您可能还想在curl选项中添加follow_location.相信我,无论谁说谷歌的结果是不可能的,都是完全的蠢事,应该把他/她的电脑扔到墙上,希望永远不再回到互联网上.您可以使用PHP cURL或Python中的libCURL模拟使用您自己的浏览器执行"IRL"的所有操作.你只需要做更多的cURLS来获得buff.然后你会明白我的意思.:)

  $url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1');
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);

Run Code Online (Sandbox Code Playgroud)

没有尝试实际的代码,但很棒的帖子哈哈. (2认同)
使用整个 `$url` 周围的 `urlencode()`，你最终会转义 "://" 等，这是 cURL 不喜欢的。要使其工作，只需在`$url` 中使用`urlencode($strSearch)`，并从`CURLOPT_URL` 行中删除`urlencode()`。 (2认同)

Answer 3

One*_*rew 5

尝试这个：

$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，4 月前
查看次数：	138875 次
最近记录：	6 年，7 月前