php xPath代码优化

Bla*_*ake 0 php optimization xpath dom

我正在为一个有点慢的网站编写一个页面抓取工具,但是我有很多信息要用于小部件目的(经过他们的许可).目前到目前为止,我需要粗略4-5 minutes地执行和解析所有~150 pages我刮擦.它将是一个crontab'd事件,并且在生成时使用临时表,然后在完成后复制到"实时"表,这样就可以从客户端站点无缝过渡,但是你能看到加速的方法吗?我的代码,可能吗?

//mysql connection stuff here
function dnl2array($domnodelist) {
    $return = array();
    $nb = $domnodelist->length;
    for ($i = 0; $i < $nb; ++$i) {
        $return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
        $return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
    }
    return $return;
}

function get_inner_html( $node ) { 
    $innerHTML= ''; 
    $children = $node->childNodes; 
    foreach ($children as $child) { 
        $innerHTML .= $child->ownerDocument->saveXML( $child ); 
    } 

    return $innerHTML; 
}

// NEW curl instead of file_get_contents()
    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 20);

    // Grab the data.
    $html = curl_exec($c);

    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
             "<p>cURL error: " . curl_error($c) . "</p>";
    }

// $html = file_get_contents($url);
$doc = new DOMDocument;

// Load the html into our object
$doc->loadHTML($html);

$xPath = new DOMXPath( $doc );

// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[@id="food-plan-contents"]//td[@class="product-name"]');
$test['itams'] = dnl2array($results);

foreach($test['itams']['html'] as $get_url){
    $prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
    foreach($prepared_url as $url){

    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 20);

    // Grab the data.
    $html = curl_exec($c);

    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
             "<p>cURL error: " . curl_error($c) . "</p>";
    }

// $html = file_get_contents($url);
        $doc = new DOMDocument;
        $doc->loadHTML($html);

        $xPath = new DOMXPath($doc);

        $results = $xPath->query('//h3[@class="product-name"]');
        $arr[$i]['name'] = dnl2array($results);

        $results = $xPath->query('//div[@class="product-specs"]');
        $arr[$i]['desc'] = dnl2array($results);

        $results = $xPath->query('//p[@class="product-image-zoom"]');
        $arr[$i]['img'] = dnl2array($results);

        $results = $xPath->query('//div[@class="groupedTable"]/table/tbody/tr//span[@class="price"]');
        $arr[$i]['price'] = dnl2array($results);
        $arr[$i]['url'] = $url;
        if($i % 5 == 1){
            lazy_loader($arr); //lazy loader adds data to sql database
            unset($arr); // keep memory footprint light (server is wimpy -- but free!)
        }

        $i++;
        usleep(50000); // Don't be bandwith pig
    }
        // Get any stragglers
        if(count($arr) > 0){
            lazy_loader($arr);
            $time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
            $tab_name = "sr_data_items_" . date("m_d_y", $time);
            // and copy table now that script is finished
            mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
            mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
            mysql_query("TRUNCATE TABLE  `sr_data_items_skel`");
        }
Run Code Online (Sandbox Code Playgroud)

ric*_*t1k 6

听起来你主要是处理服务器响应速度慢的问题.对于这150页中的每一页,即使是2秒,你也会看到300秒= 5分钟.加快速度的最佳方法是使用curl_multi_*同时运行多个连接.

所以用这个替换foreach循环的开始(通过if!html检查):

reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;

$mh = curl_multi_init();


$i = 0;
while(!$finished || !empty($running)){
    // add urls to $mh up to a maximum
    while (count($running) < 15 && !$finished)
    {
        $url = next($prepared_url);
        if ($url === FALSE)
        {
            $finished = true;
            break;
        }

        $c = setupcurl($url);

        curl_multi_add_handle($mh, $c);

        $running[$c] = $url;
    }

    curl_multi_exec($mh, $active);
    $info = curl_multi_info_read($mh);
    if (false === $info) continue; // nothing to report right now

    $c = $info['handle'];
    $url = $running[$c];
    unset($running[$c]);

    $result = $info['result'];
    if ($result != CURLE_OK)
    {
        echo "Curl Error: " . $result . "\n";
        continue;
    }

    $html = curl_multi_getcontent($c);

    $download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);

    curl_multi_remove_handle($mh, $c);



    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
             "<p>cURL error: " . curl_error($c) . "</p>\n";
    }

    curl_close($c);

    <<rest of foreach loop here>>
Run Code Online (Sandbox Code Playgroud)

这将同时保持15次下载,并在完成后处理它们.