Dan*_*iel 1 c# html-parsing async-await html-agility-pack
我试图做一个webscraper,从html文件中获取css / js / images的所有下载链接。
问题
第一个断点确实命中,但第二个断点未命中“ Continue”之后。
我正在谈论的代码:
private static async void GetHtml(string url, string downloadDir)
{
//Get html data, create and load htmldocument
HttpClient httpClient = new HttpClient();
//This code gets executed
var html = await httpClient.GetStringAsync(url);
//This code not
Console.ReadLine();
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//Get all css download urls
var linkUrl = htmlDocument.DocumentNode.Descendants("link")
.Where(node => node.GetAttributeValue("type", "")
.Equals("text/css"))
.Select(node=>node.GetAttributeValue("href",""))
.ToList();
//Downloading css, js, images and source code
using (var client = new WebClient())
{
for (var i = 0; i <scriptUrl.Count; i++)
{
Uri uri = new Uri(scriptUrl[i]);
client.DownloadFile(uri,
downloadDir + @"\js\" + uri.Segments.Last());
}
}
Run Code Online (Sandbox Code Playgroud)
编辑
我从这里调用getHtml方法:
private static void Start()
{
//Create a list that will hold the names of all the subpages
List<string> subpagesList = new List<string>();
//Ask user for url and asign that to var url, also add the url to the url list
Console.WriteLine("Geef url van de website:");
string url = "https://www.hethwc.nl";
//Ask user for download directory and assign that to var downloadDir
Console.WriteLine("Geef locatie voor download:");
var downloadDir = @"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\";
//Download and save the index file
var htmlSource = new System.Net.WebClient().DownloadString(url);
System.IO.File.WriteAllText(@"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\index.html", htmlSource);
// Creating directories
string jsDirectory = System.IO.Path.Combine(downloadDir, "js");
string cssDirectory = System.IO.Path.Combine(downloadDir, "css");
string imagesDirectory = System.IO.Path.Combine(downloadDir, "images");
System.IO.Directory.CreateDirectory(jsDirectory);
System.IO.Directory.CreateDirectory(cssDirectory);
System.IO.Directory.CreateDirectory(imagesDirectory);
GetHtml("https://www.hethwc.nu", downloadDir);
}
Run Code Online (Sandbox Code Playgroud)
你怎么打GetHtml?大概是通过sync Main方法获得的,并且您没有任何其他非工作线程在运行(因为您的主线程已退出):该过程将终止。就像是:
static void Main() {
GetHtml();
}
Run Code Online (Sandbox Code Playgroud)
上面的代码将在GetHtml返回后立即终止该过程,并且该Main方法将在第一个未完成await点结束。
在当前的C#版本(从C#7.1开始)中,您可以创建一个async Task Main()方法,只要您更改为return ,该方法就可以正确使用await您的GetHtml方法:GetHtmlTask
async static Task Main() {
await GetHtml();
}
Run Code Online (Sandbox Code Playgroud)