多线程网络刮刀？

Question

多线程网络刮刀？

Alp*_*lta 3 c# multithreading .net-4.0 scraper web

我一直在考虑让我的web scraper多线程,而不是像普通的线程(例如,线程scrape =新的线程(函数);)但是像线程池这样的东西可以有很多线程.

我的刮刀通过使用for循环来刮擦页面.

for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)

Run Code Online (Sandbox Code Playgroud)

那么我怎么能用线程池这样的多线程函数(包含循环)多线程？我以前从未使用过线程池,我见过的例子对我来说很混乱或模糊.

我已经将我的循环修改为:

int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
    //Scraping
});

Run Code Online (Sandbox Code Playgroud)

会有用还是我弄错了？

Answer 1

Jim*_*hel 5

使用池线程的问题是它们大部分时间都在等待来自Web站点的响应.使用的问题Parallel.ForEach是它限制了你的并行性.

通过使用异步Web请求,我获得了最佳性能.我使用a Semaphore来限制并发请求的数量,并且回调函数进行了抓取.

主线程创建Semaphore,如下所示:

Semaphore _requestsSemaphore = new Semaphore(20, 20);

Run Code Online (Sandbox Code Playgroud)

这20是通过反复试验得出的.事实证明,限制因素是DNS分辨率,平均而言,它需要大约50毫秒.至少,它确实在我的环境中.20个并发请求是绝对最大值.15可能更合理.

主线程实际上是循环的,如下所示:

while (true)
{
    _requestsSemaphore.WaitOne();
    string urlToCrawl = DequeueUrl();  // however you do that
    var request = (HttpWebRequest)WebRequest.Create(urlToCrawl);
    // set request properties as appropriate
    // and then do an asynchronous request
    request.BeginGetResponse(ResponseCallback, request);
}

Run Code Online (Sandbox Code Playgroud)

该ResponseCallback方法将在池线程上调用,进行处理,处理响应,然后释放信号量,以便可以进行另一个请求.

void ResponseCallback(IAsyncResult ir)
{
    try
    {
        var request = (HttpWebRequest)ir.AsyncState;
        // you'll want exception handling here
        using (var response = (HttpWebResponse)request.EndGetResponse(ir))
        {
            // process the response here.
        }
    }
    finally
    {
        // release the semaphore so that another request can be made
        _requestSemaphore.Release();
    }
}

Run Code Online (Sandbox Code Playgroud)

正如我所说,限制因素是DNS解析.事实证明,DNS解析是在调用线程(在这种情况下是主线程)上完成的.看这真的是异步吗？欲获得更多信息.

这很容易实现并且运行良好.根据我的经验,这可能会获得超过20个并发请求,但这样做需要相当多的努力.我不得不做很多DNS缓存......好吧,这很难.

您可以通过Task在C#5.0(.NET 4.5)中使用和新的异步内容来简化上述操作.不过,我对那些人不太熟悉.

归档时间：	12 年，8 月前
查看次数：	2251 次
最近记录：	11 年，9 月前