use*_*123 8 php curl curl-multi
我正在开发一个应用程序,它从一组站点获取所有URL并以数组形式或JSON显示它.
我可以使用for循环,问题是我尝试10个URL时的执行时间,它给我一个错误说exceeds maximum execution time.
在搜索时我发现了这个 multi curl
我还发现了这个快速PHP CURL多个请求:使用CURL检索多个URL的内容.我试图添加我的代码,但没有工作,因为我不知道如何使用该功能.
希望你帮帮我.
谢谢.
这是我的示例代码.
<?php
$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');
$mh = curl_multi_init();
foreach ($urls as $i => $url) {
$urlContent = file_get_contents($url);
$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo '<a href="'.$url.'">'.$url.'</a><br />';
}
}
$conn[$i]=curl_init($url);
$fp[$i]=fopen ($g, "w");
curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
curl_multi_add_handle ($mh,$conn[$i]);
}
do {
$n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
curl_multi_remove_handle($mh,$conn[$i]);
curl_close($conn[$i]);
fclose ($fp[$i]);
}
curl_multi_close($mh);
?>
Run Code Online (Sandbox Code Playgroud)
这是我放在一起的一个功能,可以正确利用该curl_multi_init()功能。它或多或少与您在 PHP.net 上找到的功能相同,但有一些细微的调整。我在这方面取得了巨大的成功。
function multi_thread_curl($urlArray, $optionArray, $nThreads) {
//Group your urls into groups/threads.
$curlArray = array_chunk($urlArray, $nThreads, $preserve_keys = true);
//Iterate through each batch of urls.
$ch = 'ch_';
foreach($curlArray as $threads) {
//Create your cURL resources.
foreach($threads as $thread=>$value) {
${$ch . $thread} = curl_init();
curl_setopt_array(${$ch . $thread}, $optionArray); //Set your main curl options.
curl_setopt(${$ch . $thread}, CURLOPT_URL, $value); //Set url.
}
//Create the multiple cURL handler.
$mh = curl_multi_init();
//Add the handles.
foreach($threads as $thread=>$value) {
curl_multi_add_handle($mh, ${$ch . $thread});
}
$active = null;
//execute the handles.
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//Get your data and close the handles.
foreach($threads as $thread=>$value) {
$results[$thread] = curl_multi_getcontent(${$ch . $thread});
curl_multi_remove_handle($mh, ${$ch . $thread});
}
//Close the multi handle exec.
curl_multi_close($mh);
}
return $results;
}
//Add whatever options here. The CURLOPT_URL is left out intentionally.
//It will be added in later from the url array.
$optionArray = array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',//Pick your user agent.
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 10
);
//Create an array of your urls.
$urlArray = array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/'
);
//Play around with this number and see what works best.
//This is how many urls it will try to do at one time.
$nThreads = 20;
//To use run the function.
$results = multi_thread_curl($urlArray, $optionArray, $nThreads);
Run Code Online (Sandbox Code Playgroud)
完成后,您将拥有一个包含网站列表中所有 html 的数组。正是在这一点上,我将遍历它们并拉出所有网址。
像这样:
foreach($results as $page){
$dom = new DOMDocument();
@$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo '<a href="'.$url.'">'.$url.'</a><br />';
}
}
}
Run Code Online (Sandbox Code Playgroud)
还值得将增加脚本运行时间的能力牢记在心。
如果您使用托管服务,无论您将最大执行时间设置为多少,您都可能会被限制在两分钟的球场内。只是深思熟虑。
这是通过以下方式完成的:
ini_set('max_execution_time', 120);
你总是可以尝试更多的时间,但你永远不会知道,直到你计时。
希望能帮助到你。
| 归档时间: |
|
| 查看次数: |
1142 次 |
| 最近记录: |