我想要的是,从网站(从HtmlContent)打开一个链接,并获得这个新开放网站的Html ..
示例:我有www.google.com,现在我想查找所有链接.对于每个链接,我想拥有新网站的HTMLContent.
我做这样的事情:
foreach (String link in GetLinksFromWebsite(htmlContent))
{
using (var client = new WebClient())
{
htmlContent = client.DownloadString("http://" + link);
}
foreach (Match treffer in istBildURL)
{
string bildUrl = treffer.Groups[1].Value;
bildLinks.Add(bildUrl);
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
string linkPattern = "<a href=\"(.*?)\">(.*?)</a>";
MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
List<string> linkContents = new List<string>();
foreach (Match match in linkMatches)
{
linkContents.Add(match.Value);
}
return linkContents;
}
Run Code Online (Sandbox Code Playgroud)
另一个问题是,我只获得链接,而不是链接按钮(ASP.NET)..我怎样才能解决问题?
要遵循的步骤:
regex或者regular expression从项目开始,并处理解析HTML(阅读此答案以更好地理解原因).在您的情况下,这将是GetLinksFromWebsite方法的内容.这是一个例子:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
var htmlSource = client.DownloadString("http://www.stackoverflow.com");
foreach (var item in GetLinksFromWebsite(htmlSource))
{
// TODO: you could easily write a recursive function
// that will call itself here and retrieve the respective contents
// of the site ...
Console.WriteLine(item);
}
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
return doc
.DocumentNode
.SelectNodes("//a[@href]")
.Select(node => node.Attributes["href"].Value)
.ToList();
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4302 次 |
| 最近记录: |