有人可以解释一下这个HtmlAgilityPack代码吗?

sup*_*er9 4 c# web-scraping html-agility-pack

我已尽力通过代码添加注释,但我有点卡在某些部分.

// create a new instance of the HtmlDocument Class called doc
1: HtmlDocument doc = new HtmlDocument();

// the Load method is called here to load the variable result which is html 
// formatted into a string in a previous code snippet
2: doc.Load(new StringReader(result));

// a new variable called root with datatype HtmlNode is created here. 
// Im not sure what doc.DocumentNode refers to?
3: HtmlNode root = doc.DocumentNode;
4:  

// a list is getting constructed here. I haven't had much experience 
// with constructing lists yet
5: List<string> anchorTags = new List<string>();
6:  

// a foreach loop is used to loop through the html document to 
// extract html with 'a' attributes I think..      
7: foreach (HtmlNode link in root.SelectNodes("//a"))
8: {
// dont really know whats going on here
9:     string att = link.OuterHtml;
// dont really know whats going on here too
10:     anchorTags.Add(att)
11: }
Run Code Online (Sandbox Code Playgroud)

我从这里解除了这个代码示例.感谢Farooq Kaiser

Lov*_*ode 5

关键是SelectNodes方法.这部分使用XPath从HTML中获取与您的查询匹配的节点列表.

这是我学习XPath的地方:http://www.w3schools.com/xpath/default.asp

然后它只是遍历那些匹配并获取OuterHTML的节点 - 包含标签的完整HTML,并将它们添加到字符串列表中.List基本上只是一个数组,但更灵活.如果您只想要内容而不是封闭标记,则可以使用HtmlNode.InnerHTML或HtmlNode.InnerText.

  • +1,对于那些发现XPath难以理解的人,你可以使用`Elements()`/`Descendents()`然后使用标准的LinqToXml`XElement`语法查询所有内容. (2认同)

Sim*_*ier 5

在HTML Agility Pack术语中,"// a"表示"在文档中的任何位置查找名为'a'或'A'的所有标记".有关XPATH的更一般帮助,请参阅XPATH文档(独立于HTML敏捷包).所以,如果您的文档看起来像这样:

<div>
  <A href="xxx">anchor 1</a>
  <table ...>
    <a href="zzz">anchor 2</A>
  </table>
</div>
Run Code Online (Sandbox Code Playgroud)

您将获得两个锚点HTML元素.OuterHtml表示节点的HTML,包括节点本身,而InnerHtml仅表示节点的HTML内容.所以,这里有两个OuterHtml:

  <A href="xxx">anchor 1</a>
Run Code Online (Sandbox Code Playgroud)

<a href="zzz">anchor 2</A>
Run Code Online (Sandbox Code Playgroud)

注意我已经指定了'a'或'A',因为HAP实现需要注意或HTML不区分大小写.并且"// A"默认情况下不起作用.您需要使用小写指定标记.