ber*_*orr 8 concurrency perl asynchronous http download
我正在尝试与脚本并行获取大约7个数据库:第一个是HTTP :: Async,第二个是在pastebin上,使用Net :: Async :: HTTP.问题是我得到了相同的计时结果 - 所有网址列表大约8..14秒.与从shell开始的curl + xargs相比,这是不可接受的缓慢,使用10-20"线程"可以在不到3秒的时间内完成所有操作.例如,第一个脚本中的Devel :: Timer显示最大队列长度甚至小于6($queue->in_progress_count<= 5,$queue->to_send_count= 0总是).因此,看起来像$ queue-> add的foreach执行速度太慢,我不知道为什么.我使用Net :: Async :: HTTP(在pastebin上的第二个脚本)获得的情况非常相似,这比第一个更慢.
那么,拜托,有人知道吗,我做错了什么?与从shell启动的curl + xargs相比,我怎样才能获得并发下载速度?
#!/usr/bin/perl -w
use utf8;
use strict;
use POSIX qw(ceil);
use XML::Simple;
use Data::Dumper;
use HTTP::Request;
use HTTP::Async;
use Time::HiRes qw(usleep time);
use Devel::Timer;
#settings
use constant passwd => 'ultramegahypapassword';
use constant agent => 'supa agent dev.alpha';
use constant timeout => 10;
use constant slots => 10;
use constant debug => 1;
my @qids;
my @xmlz;
my $queue = HTTP::Async->new(slots => slots,max_request_time => 10, timeout => timeout, poll_interval => 0.0001);
my %responses;
my @urlz = (
'http://testpodarki.afghanet/api/products/4577',
'http://testpodarki.afghanet/api/products/4653',
'http://testpodarki.afghanet/api/products/4652',
'http://testpodarki.afghanet/api/products/4571',
'http://testpodarki.afghanet/api/products/4572',
'http://testpodarki.afghanet/api/products/4666',
'http://testpodarki.afghanet/api/products/4576',
'http://testpodarki.afghanet/api/products/4574',
'http://testpodarki.afghanet/api/products/4651',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/product_option_values/188',
'http://testpodarki.afghanet/api/product_option_values/191',
'http://testpodarki.afghanet/api/product_option_values/187',
'http://testpodarki.afghanet/api/product_option_values/190',
'http://testpodarki.afghanet/api/product_option_values/189',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/images/products/4577/12176',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/images/products/4653/12390',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/images/products/4652/12388',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/3294/8965',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/images/products/4571/12159',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/images/products/4572/12168',
'http://testpodarki.afghanet/api/product_option_values/185',
'http://testpodarki.afghanet/api/product_option_values/182',
'http://testpodarki.afghanet/api/product_option_values/184',
'http://testpodarki.afghanet/api/product_option_values/183',
'http://testpodarki.afghanet/api/product_option_values/186',
'http://testpodarki.afghanet/api/images/products/4666/12413',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/4576/12174',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4574/12171',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4651/12387'
);
my $timer = Devel::Timer->new();
foreach my $el (@urlz) {
my $request = HTTP::Request->new(GET => $el);
$request->header(User_Agent => agent);
$request->authorization_basic(passwd,'');
push @qids,$queue->add($request);
$timer->mark("pushed [$el], to_send=".$queue->to_send_count().", to_return=".$queue->to_return_count().", in_progress=".$queue->in_progress_count());
}
$timer->mark('requestz pushed');
while ($queue->in_progress_count) {
usleep(2000);
$queue->poke();
}
$timer->mark('requestz complited');
process_responses();
$timer->mark('responzez processed');
foreach my $q (@xmlz) {
# print ">>>>>>".Dumper($q)."<<<<<<<<\n";
}
$timer->report();
print "\n\n";
Run Code Online (Sandbox Code Playgroud)
HTTP :: Async的最佳结果是超过4秒,最多超过5秒.据我所知,这种方法不是必需的,这是一个简单的分叉示例,需要2点多一点,最多3秒钟.
它使用Parallel :: ForkManager和LWP :: UserAgent进行下载.
use warnings;
use strict;
use Path::Tiny;
use LWP::UserAgent;
use Parallel::ForkManager;
my @urls = @{ get_urls('https://pastebin.com/raw/VyhMEB3w') };
my $pm = new Parallel::ForkManager(60); # max of 60 processes at a time
my $ua = LWP::UserAgent->new;
print "Downloading ", scalar @urls, " files.\n";
my $dir = 'downloaded_files/';
mkdir $dir if not -d $dir;
my $cnt = 0;
foreach my $link (@urls)
{
my $file = "$dir/file_" . ++$cnt . '.txt';
$pm->start and next; # child process
# add code needed for actual pages (authorization etc)
my $response = $ua->get($link);
if ($response->is_success) {
path($file)->spew_utf8($response->decoded_content);
}
else { warn $response->status_line }
$pm->finish; # child exit
}
$pm->wait_all_children;
sub get_urls {
my $resp = LWP::UserAgent->new->get($_[0]);
return [ grep /^http:/, split /\s*'?,?\s*\n\s*'?/, $resp->decoded_content ];
};
Run Code Online (Sandbox Code Playgroud)
这些文件是使用Path :: Tiny编写的.它path构建一个对象和spew例程来编写文件.
作为参考,顺序下载大约需要26秒.
最大进程数设置为30,这需要4秒,而60则超过2秒,大约与(最多)90相同.此测试中有70个URL.
在具有良好网络连接的4核笔记本电脑上进行测试.(这里的CPU并不是那么重要.)测试在多个时间和多天重复运行.
与问题的方法进行比较
最好的HTTP::Async结果比上面慢了大约两倍.他们有30-40个"插槽",因为数字越多,时间越长,谜题(我).该模块使用select复用,经由网:: HTTP :: NB(的非阻塞版本净:: HTTP).虽然select"不能很好地扩展",但这涉及数百个套接字,并且我希望能够在这个网络绑定问题上使用超过40个.简单的分叉方法.
此外,select被认为是监视套接字的慢速方法,而叉甚至不需要,因为每个进程都有自己的URL.(当有许多连接时,这可能导致模块的开销?)Fork的固有开销是固定的,并且由于网络访问而相形见绌.如果我们追踪(很多)数百次下载,系统可能会因流程而变得紧张,但select也不会很好.
最后,select基于方法一次严格下载一个文件,并且当请求被add编辑时通过打印看到效果- 我们可以看到延迟.分叉下载并行(在这种情况下,所有70个同时没有问题).然后会出现网络或磁盘瓶颈,但与增益相比,这个瓶颈很小.
更新:我把它推到了站点和进程数量的两倍,没有看到OS/CPU压力的迹象,并保持平均速度.
所以我要说,如果你需要剃掉每一秒使用的叉子.但是,如果这不是关键的,并且还有其他好处HTTP::Async(或者这样的话)那么满足(只是一点点)更长的下载.
HTTP::Async表现良好的代码最终变得简单
foreach my $link ( @urls ) {
$async->add( HTTP::Request->new(GET => $link) );
}
while ( my $response = $async->wait_for_next_response ) {
# write file (or process otherwise)
}
Run Code Online (Sandbox Code Playgroud)
我也尝试调整标题和时间.(这包括keep-alive按建议删除$request->header(Connection => 'close'),没有效果.)