我需要从SEC网站下载大约200万个文件.每个文件都有一个唯一的URL,平均为10kB.这是我目前的实施:
List<string> urls = new List<string>();
// ... initialize urls ...
WebBrowser browser = new WebBrowser();
foreach (string url in urls)
{
browser.Navigate(url);
while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
StreamReader sr = new StreamReader(browser.DocumentStream);
StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/')));
sw.Write(sr.ReadToEnd());
sr.Close();
sw.Close();
}
Run Code Online (Sandbox Code Playgroud)
预计的时间大约是12天......有更快的方法吗?
编辑:顺便说一句,本地文件处理只占7%的时间
编辑:这是我最后的实现:
void Main(void)
{
ServicePointManager.DefaultConnectionLimit = 10000;
List<string> urls = new List<string>();
// ... initialize urls ...
int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg));
}
public int downloadFile(string url)
{
int retries = 0;
retry:
try
{
HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url);
webrequest.Timeout = 10000;
webrequest.ReadWriteTimeout = 10000;
webrequest.Proxy = null;
webrequest.KeepAlive = false;
webresponse = (HttpWebResponse)webrequest.GetResponse();
using (Stream sr = webrequest.GetResponse().GetResponseStream())
using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/'))))
{
sr.CopyTo(sw);
}
}
catch (Exception ee)
{
if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.")
{
if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the transport connection: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++;
else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
goto retry;
}
}
return retries;
}
Run Code Online (Sandbox Code Playgroud)
Myl*_*ell 12
同时执行下载而不是顺序执行,并设置合理的MaxDegreeOfParallelism,否则您将尝试进行太多同时请求,看起来像DOS攻击:
public static void Main(string[] args)
{
var urls = new List<string>();
Parallel.ForEach(
urls,
new ParallelOptions{MaxDegreeOfParallelism = 10},
DownloadFile);
}
public static void DownloadFile(string url)
{
using(var sr = new StreamReader(HttpWebRequest.Create(url)
.GetResponse().GetResponseStream()))
using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/'))))
{
sw.Write(sr.ReadToEnd());
}
}
Run Code Online (Sandbox Code Playgroud)
在多个线程中下载文件.线程数取决于您的吞吐量.另外,看看 WebClient和HttpWebRequest课程.简单样本:
var list = new[]
{
"http://google.com",
"http://yahoo.com",
"http://stackoverflow.com"
};
var tasks = Parallel.ForEach(list,
s =>
{
using (var client = new WebClient())
{
Console.WriteLine($"starting to download {s}");
string result = client.DownloadString((string)s);
Console.WriteLine($"finished downloading {s}");
}
});
Run Code Online (Sandbox Code Playgroud)
我会并行使用几个线程WebClient.我建议将最大并行度设置为您想要的线程数,因为未指定的并行度对于长时间运行的任务不起作用.我在我的一个项目中使用了50个并行下载没有问题,但根据单个下载的速度,低得多可能就足够了.
如果从同一服务器并行下载多个文件,则默认情况下限制为少量(2或4)并行下载.虽然http标准指定了这样的低限制,但许多服务器不强制执行它.使用ServicePointManager.DefaultConnectionLimit = 10000;以增加限制.