multi curl 无法一次处理超过 200 个请求

san*_*jiv 4 php curl

你能告诉我,使用 multi_curl 发送请求有什么限制吗?当我尝试发送超过 200 的请求时,它超时了。

见下面的代码 ..................................... …………

foreach($newUrlArry as $url){   
            $gatherUrl[] = $url['url'];
        }
        /*...................Array slice----------------------*/

        $totalUrlRequest = count($gatherUrl);
        if($totalUrlRequest > 10){
            $offset = 10;
            $index = 0;
            $matchedAnchors = array();
            $dom = new DOMDocument;
            $NoOfTilesRequest = ceil($totalUrlRequest/$offset);
            for($sl = 0; $sl<$NoOfTilesRequest;$sl++){
                $output = array_slice($gatherUrl, $index, $offset);
                $index = $offset+$index;
                $responseAction = $this->multiRequestAction($output);
                $k=0;
                foreach($responseAction as $responseHtml){
                @$dom->loadHTML($responseHtml);
                $documentLinks = $dom->getElementsByTagName("a");
                $chieldFlag = false;
                for($i=0;$i<$documentLinks->length;$i++) {
                $documentLink = $documentLinks->item($i);
                   if ($documentLink->hasAttribute('href') AND substr($documentLink->getAttribute('href'), 0, strlen($match)) == $match) {
                            $description = $documentLink->childNodes;
                            foreach($description as $words) {
                                $name =  trim($words->nodeName);
                                if($name == 'em' ||  $name == 'b' || $name=="span" || $name=="p") {
                                    if(!empty($words->nodeValue)) {
                                        $matchedAnchors[$sl][$k]['anchor']  = trim($words->nodeValue);
                                        $matchedAnchors[$sl][$k]['img']         = 0;
                                        if($documentLink->hasAttribute('rel'))
                                            $matchedAnchors[$sl][$k]['rel']    = 'Y';
                                        else
                                            $matchedAnchors[$sl][$k]['rel']    = 'N';   
                                        $chieldFlag = true;
                                        break;
                                    }
                                }
                                elseif($name == 'img' ) { 
                                    $alt= $words->getAttribute('alt');
                                    if(!empty($alt)) {
                                        $matchedAnchors[$sl][$k]['anchor']  =  trim($words->getAttribute('alt'));
                                        $matchedAnchors[$sl][$k]['img']         = 1; 
                                        if($documentLink->hasAttribute('rel'))
                                            $matchedAnchors[$sl][$k]['rel']    = 'Y';
                                        else
                                            $matchedAnchors[$sl][$k]['rel']    = 'N';   
                                        $chieldFlag = true;
                                        break;
                                    }
                                }

                            }
                            if(!$chieldFlag){
                                $matchedAnchors[$sl][$k]['anchor']  = $documentLink->nodeValue;
                                $matchedAnchors[$sl][$k]['img']         = 0; 
                                if($documentLink->hasAttribute('rel'))
                                    $matchedAnchors[$sl][$k]['rel']    = 'Y';
                                else
                                    $matchedAnchors[$sl][$k]['rel']    = 'N';   
                            }

                        }

                    }$k++;
                }       
            }
        }
Run Code Online (Sandbox Code Playgroud)

Mik*_*roa 5

@Phliplip 和 @lunixbochs 都提到了常见的 cURL 陷阱(最大执行时间和被目标服务器拒绝。)

当向同一台服务器发送那么多 cURL 请求时,我尝试“保持良好”并自愿安排睡眠时间,这样我就不会轰炸主机。对于一个低端站点,1000+ 的请求就像一个迷你的 DDOS!

这是对我有用的代码。我曾经从他们的旧网站上抓取客户的产品数据,因为数据被锁定在没有导出功能的专有数据库系统中。

<?php
header('Content-type: text/html; charset=utf-8', true);
set_time_limit(0);
$urls = array(
    'http://www.example.com/cgi-bin/product?id=500', 
    'http://www.example.com/cgi-bin/product?id=501',  
    'http://www.example.com/cgi-bin/product?id=502',  
    'http://www.example.com/cgi-bin/product?id=503',  
    'http://www.example.com/cgi-bin/product?id=504', 
);
$i = 0;
foreach($urls as $url){
    echo $url."\n";
    $curl = curl_init($url);
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($curl, CURLOPT_AUTOREFERER, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
    curl_setopt($curl, CURLOPT_TIMEOUT, 25 );
    $html = curl_exec($curl);
    $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');  
    curl_close($curl);
    // now do something with info returned by curl 
    $i++;
    if($i%10==0){
        sleep(20);
    } else {
        sleep(2);
    }
}
?>
Run Code Online (Sandbox Code Playgroud)

主要特点是:

  • 没有最大执行时间
  • 自愿睡眠
  • 每个请求的新 curl init & exec。

根据我的经验,进入 sleep() 将阻止服务器拒绝您。但是,如果“不同的不同服务器”表示您正在向大量服务器发送少量请求,例如:

$urls = array(
    'http://www.example-one.com/', 
    'http://www.example-two.com/', 
    'http://www.example-three.com/', 
    'http://www.example-four.com/', 
    'http://www.example-five.com/', 
    'http://www.example-six.com/'
);
Run Code Online (Sandbox Code Playgroud)

然后你正在使用set_time_limit(0);一些东西然后一个错误可能会导致你的代码die;尝试

ini_set('display_errors',1); 
error_reporting(E_ALL);
Run Code Online (Sandbox Code Playgroud)

并告诉我们您收到的错误消息。