Perl与HTTP :: Async和Net :: Async :: HTTP的并发下载速度太慢

ber*_*orr 8 concurrency perl asynchronous http download

我正在尝试与脚本并行获取大约7个数据库:第一个是HTTP :: Async,第二个是在pastebin上,使用Net :: Async :: HTTP.问题是我得到了相同的计时结果 - 所有网址列表大约8..14秒.与从shell开始的curl + xargs相比,这是不可接受的缓慢,使用10-20"线程"可以在不到3秒的时间内完成所有操作.例如,第一个脚本中的Devel :: Timer显示最大队列长度甚至小于6($queue->in_progress_count<= 5,$queue->to_send_count= 0总是).因此,看起来像$ queue-> add的foreach执行速度太慢,我不知道为什么.我使用Net :: Async :: HTTP(在pastebin上的第二个脚本)获得的情况非常相似,这比第一个更慢.

那么,拜托,有人知道吗,我做错了什么?与从shell启动的curl + xargs相比,我怎样才能获得并发下载速度?

#!/usr/bin/perl -w
use utf8;
use strict;
use POSIX qw(ceil);
use XML::Simple;
use Data::Dumper;
use HTTP::Request;
use HTTP::Async;
use Time::HiRes qw(usleep time);
use Devel::Timer;

#settings
use constant passwd => 'ultramegahypapassword';
use constant agent => 'supa agent dev.alpha';
use constant timeout => 10;
use constant slots => 10;
use constant debug => 1;

my @qids;
my @xmlz;
my $queue = HTTP::Async->new(slots => slots,max_request_time => 10, timeout => timeout, poll_interval => 0.0001);
my %responses;
my @urlz = (
'http://testpodarki.afghanet/api/products/4577',
'http://testpodarki.afghanet/api/products/4653',
'http://testpodarki.afghanet/api/products/4652',
'http://testpodarki.afghanet/api/products/4571',
'http://testpodarki.afghanet/api/products/4572',
'http://testpodarki.afghanet/api/products/4666',
'http://testpodarki.afghanet/api/products/4576',
'http://testpodarki.afghanet/api/products/4574',
'http://testpodarki.afghanet/api/products/4651',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/product_option_values/188',
'http://testpodarki.afghanet/api/product_option_values/191',
'http://testpodarki.afghanet/api/product_option_values/187',
'http://testpodarki.afghanet/api/product_option_values/190',
'http://testpodarki.afghanet/api/product_option_values/189',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/images/products/4577/12176',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/images/products/4653/12390',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/images/products/4652/12388',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/3294/8965',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/images/products/4571/12159',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/images/products/4572/12168',
'http://testpodarki.afghanet/api/product_option_values/185',
'http://testpodarki.afghanet/api/product_option_values/182',
'http://testpodarki.afghanet/api/product_option_values/184',
'http://testpodarki.afghanet/api/product_option_values/183',
'http://testpodarki.afghanet/api/product_option_values/186',
'http://testpodarki.afghanet/api/images/products/4666/12413',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/4576/12174',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4574/12171',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4651/12387'
);

my $timer = Devel::Timer->new();


foreach my $el (@urlz) {
    my $request = HTTP::Request->new(GET => $el);
    $request->header(User_Agent => agent);
    $request->authorization_basic(passwd,''); 
    push @qids,$queue->add($request);
    $timer->mark("pushed [$el], to_send=".$queue->to_send_count().", to_return=".$queue->to_return_count().", in_progress=".$queue->in_progress_count());
}

$timer->mark('requestz pushed');

while ($queue->in_progress_count) {
    usleep(2000);
    $queue->poke();
}

$timer->mark('requestz complited');

process_responses();


$timer->mark('responzez processed');

foreach my $q (@xmlz) {
#    print ">>>>>>".Dumper($q)."<<<<<<<<\n";
}

$timer->report();
print "\n\n";
Run Code Online (Sandbox Code Playgroud)

zdi*_*dim 6

HTTP :: Async的最佳结果是超过4秒,最多超过5秒.据我所知,这种方法不是必需的,这是一个简单的分叉示例,需要2点多一点,最多3秒钟.

它使用Parallel :: ForkManagerLWP :: UserAgent进行下载.

use warnings;
use strict;
use Path::Tiny;    
use LWP::UserAgent;
use Parallel::ForkManager;

my @urls = @{ get_urls('https://pastebin.com/raw/VyhMEB3w') };

my $pm = new Parallel::ForkManager(60);  # max of 60 processes at a time
my $ua = LWP::UserAgent->new; 
print "Downloading ", scalar @urls, " files.\n";

my $dir = 'downloaded_files/';
mkdir $dir if not -d $dir;
my $cnt = 0;   
foreach my $link (@urls) 
{
    my $file = "$dir/file_" . ++$cnt . '.txt';

    $pm->start and next;                        # child process

    # add code needed for actual pages (authorization etc)            
    my $response = $ua->get($link);        
    if ($response->is_success) {
        path($file)->spew_utf8($response->decoded_content);
    }
    else { warn $response->status_line }

    $pm->finish;                                # child exit
}
$pm->wait_all_children;

sub get_urls {
    my $resp = LWP::UserAgent->new->get($_[0]);
    return [ grep /^http:/, split /\s*'?,?\s*\n\s*'?/, $resp->decoded_content ];
};
Run Code Online (Sandbox Code Playgroud)

这些文件是使用Path :: Tiny编写的.它path构建一个对象和spew例程来编写文件.

作为参考,顺序下载大约需要26秒.

最大进程数设置为30,这需要4秒,而60则超过2秒,大约与(最多)90相同.此测试中有70个URL.

在具有良好网络连接的4核笔记本电脑上进行测试.(这里的CPU并不是那么重要.)测试在多个时间和多天重复运行.


与问题的方法进行比较

最好的HTTP::Async结果比上面慢了大约两倍.他们有30-40个"插槽",因为数字越多,时间越长,谜题(我).该模块使用select复用,经由网:: HTTP :: NB(的非阻塞版本净:: HTTP).虽然select"不能很好地扩展",但这涉及数百个套接字,并且我希望能够在这个网络绑定问题上使用超过40个.简单的分叉方法.

此外,select被认为是监视套接字的慢速方法,而叉甚至不需要,因为每个进程都有自己的URL.(当有许多连接时,这可能导致模块的开销?)Fork的固有开销是固定的,并且由于网络访问而相形见绌.如果我们追踪(很多)数百次下载,系统可能会因流程而变得紧张,但select也不会很好.

最后,select基于方法一次严格下载一个文件,并且当请求被add编辑时通过打印看到效果- 我们可以看到延迟.分叉下载并行(在这种情况下,所有70个同时没有问题).然后会出现网络或磁盘瓶颈,但与增益相比,这个瓶颈很小.

更新:我把它推到了站点和进程数量的两倍,没有看到OS/CPU压力的迹象,并保持平均速度.

所以我要说,如果你需要剃掉每一秒使用的叉子.但是,如果这不是关键的,并且还有其他好处HTTP::Async(或者这样的话)那么满足(只是一点点)更长的下载.


HTTP::Async表现良好的代码最终变得简单

foreach my $link ( @urls ) {  
    $async->add( HTTP::Request->new(GET => $link) );
}    
while ( my $response = $async->wait_for_next_response ) { 
    # write file (or process otherwise)
}
Run Code Online (Sandbox Code Playgroud)

我也尝试调整标题和时间.(这包括keep-alive按建议删除$request->header(Connection => 'close'),没有效果.)