如何使用 htmlagilitypack 从 html 文档中提取所有链接？

Question

如何使用 htmlagilitypack 从 html 文档中提取所有链接？

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(s1);

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
                {
                    count++;
                    HtmlAttribute att = link.Attributes["href"];
                    if (att.Value.StartsWith("http") && !listBox1.Items.Contains(att.Value))
                        listBox1.Items.Add(att.Value);
                }

Run Code Online (Sandbox Code Playgroud)

例如，我得到了 151 个结果，但实际上有超过 300 个。在许多情况下，它发现链接包含多个链接，例如：

href="http://www.test.com dfsdfgfg https://www.test1.com 656567 http://test2.com

Run Code Online (Sandbox Code Playgroud)

在这种情况下，我需要打破它，以便它会显示给我并算作 3 个链接，而不是一个。我试图将 att.Value.StartsWith("http") 更改为 att.Value.Contains("http") 但这不是解决方案。

Answer 1

Den*_*voy 5

您可以执行以下操作：

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
                {
                    count++;
                    HtmlAttribute att = link.Attributes["href"];
                    foreach (var link in att.Value.Split(' ')) {
                       if (link.StartsWith("http") && !listBox1.Items.Contains(link))
                           listBox1.Items.Add(link);
                    }
                }

Run Code Online (Sandbox Code Playgroud)

这将<a href="...">在 HTML 文档的标签中找到链接。如果您需要查找所有链接（包括 javascript 代码、样式等），您可以使用正则表达式，如下所示：

 private static readonly Regex cHttpUrlsRegex = new Regex(@"(?<url>((http|https):[/][/]|www.)([a-z]|[A-Z]|[0-9]|[_/.=&?%-]|[~])*)", RegexOptions.IgnoreCase);

        public static IEnumerable<string> ExtractHttpUrls(string aText, string aMatch = null)
        {
            if (String.IsNullOrEmpty(aText)) yield break;
            var matches = cHttpUrlsRegex.Matches(aText);
            var vMatcher = aMatch == null ? null : new Regex(aMatch);
            foreach (Match match in matches)
            {
                var vUrl = HttpUtility.UrlDecode(match.Groups["url"].Value);
                if (vMatcher == null || vMatcher.IsMatch(vUrl))
                    yield return vUrl;
            }
        }

foreach (var link ExtractHttpUrls(s1))
                {
                    count++;
                       if (link.StartsWith("http") && !listBox1.Items.Contains(link))
                           listBox1.Items.Add(link);
                }

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年前
查看次数：	2018 次
最近记录：	11 年前