sup*_*er9 4 c# web-scraping html-agility-pack
我已尽力通过代码添加注释,但我有点卡在某些部分.
// create a new instance of the HtmlDocument Class called doc
1: HtmlDocument doc = new HtmlDocument();
// the Load method is called here to load the variable result which is html
// formatted into a string in a previous code snippet
2: doc.Load(new StringReader(result));
// a new variable called root with datatype HtmlNode is created here.
// Im not sure what doc.DocumentNode refers to?
3: HtmlNode root = doc.DocumentNode;
4:
// a list is getting constructed here. I haven't had much experience
// with constructing lists yet
5: List<string> anchorTags = new List<string>();
6:
// a foreach loop is used to loop through the html document to
// extract html with 'a' attributes I think..
7: foreach (HtmlNode link in root.SelectNodes("//a"))
8: {
// dont really know whats going on here
9: string att = link.OuterHtml;
// dont really know whats going on here too
10: anchorTags.Add(att)
11: }
Run Code Online (Sandbox Code Playgroud)
我从这里解除了这个代码示例.感谢Farooq Kaiser
关键是SelectNodes方法.这部分使用XPath从HTML中获取与您的查询匹配的节点列表.
这是我学习XPath的地方:http://www.w3schools.com/xpath/default.asp
然后它只是遍历那些匹配并获取OuterHTML的节点 - 包含标签的完整HTML,并将它们添加到字符串列表中.List基本上只是一个数组,但更灵活.如果您只想要内容而不是封闭标记,则可以使用HtmlNode.InnerHTML或HtmlNode.InnerText.
在HTML Agility Pack术语中,"// a"表示"在文档中的任何位置查找名为'a'或'A'的所有标记".有关XPATH的更一般帮助,请参阅XPATH文档(独立于HTML敏捷包).所以,如果您的文档看起来像这样:
<div>
<A href="xxx">anchor 1</a>
<table ...>
<a href="zzz">anchor 2</A>
</table>
</div>
Run Code Online (Sandbox Code Playgroud)
您将获得两个锚点HTML元素.OuterHtml表示节点的HTML,包括节点本身,而InnerHtml仅表示节点的HTML内容.所以,这里有两个OuterHtml:
<A href="xxx">anchor 1</a>
Run Code Online (Sandbox Code Playgroud)
和
<a href="zzz">anchor 2</A>
Run Code Online (Sandbox Code Playgroud)
注意我已经指定了'a'或'A',因为HAP实现需要注意或HTML不区分大小写.并且"// A"默认情况下不起作用.您需要使用小写指定标记.
| 归档时间: |
|
| 查看次数: |
1412 次 |
| 最近记录: |