C#从网站获取链接(内容)的HTML

Question

C#从网站获取链接(内容)的HTML

我想要的是,从网站(从HtmlContent)打开一个链接,并获得这个新开放网站的Html ..

示例:我有www.google.com,现在我想查找所有链接.对于每个链接,我想拥有新网站的HTMLContent.

我做这样的事情:

foreach (String link in GetLinksFromWebsite(htmlContent))
            {
                using (var client = new WebClient())
                {
                    htmlContent = client.DownloadString("http://" + link);
                }

                foreach (Match treffer in istBildURL)
                {
                    string bildUrl = treffer.Groups[1].Value;
                    bildLinks.Add(bildUrl);
                }
            }




   public static List<String> GetLinksFromWebsite(string htmlSource)
    {
        string linkPattern = "<a href=\"(.*?)\">(.*?)</a>";
        MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
        List<string> linkContents = new List<string>();
        foreach (Match match in linkMatches)
        {
            linkContents.Add(match.Value);
        }
        return linkContents;
    }

Run Code Online (Sandbox Code Playgroud)

另一个问题是,我只获得链接,而不是链接按钮(ASP.NET)..我怎样才能解决问题？

Answer 1

Dar*_*rov 7

要遵循的步骤:

下载Html Agility Pack
引用您在项目中下载的程序集
抛出从项目开始的所有内容,regex或者regular expression从项目开始,并处理解析HTML(阅读此答案以更好地理解原因).在您的情况下,这将是GetLinksFromWebsite方法的内容.
通过简单调用Html Agility Pack解析器来替换丢弃的内容.

这是一个例子:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        using (var client = new WebClient())
        {
            var htmlSource = client.DownloadString("http://www.stackoverflow.com");
            foreach (var item in GetLinksFromWebsite(htmlSource))
            {
                // TODO: you could easily write a recursive function
                // that will call itself here and retrieve the respective contents
                // of the site ...
                Console.WriteLine(item);
            }
        }
    }

    public static List<String> GetLinksFromWebsite(string htmlSource)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(htmlSource);
        return doc
            .DocumentNode
            .SelectNodes("//a[@href]")
            .Select(node => node.Attributes["href"].Value)
            .ToList();
    }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年前
查看次数：	4302 次
最近记录：	14 年前