如何一次读取 HTML 文件的一个段落？

Question

如何一次读取 HTML 文件的一个段落？

B. *_*non 0 html c# text documentation-generation paragraph

我认为它会是这样的（伪代码）：

var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
    par = getNextParagraph();
    pars.Add(par);
}

Run Code Online (Sandbox Code Playgroud)

... getNextParagraph() 查找下一个"<p>"并继续，直到找到"</p>"，烧毁其后面的桥梁（“剪切”该段落，以便不会一遍又一遍地找到它）。或者一些这样的。

有人知道如何准确地做到这一点/更好的方法吗？

更新

我尝试使用 Aurelien Souchet 的代码。

我有以下用途：

using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;

Run Code Online (Sandbox Code Playgroud)

...但是这段代码：

HtmlDocument doc = new HtmlDocument();

Run Code Online (Sandbox Code Playgroud)

是不需要的（“此处无法访问私有构造函数'HtmlDocument' ”）

此外，“doc.LoadHtml()”和“doc.DocumentNode”都给出了旧的“无法解析符号'Bla'”错误消息

更新2

好吧，我必须在前面加上“HtmlAgilityPack”。从而消除了含糊的参考。

Answer 1

Aur*_*het 5

正如人们在评论中建议的那样，我认为 HtmlAgilityPack 是最好的选择，它易于使用并且易于找到好的示例或教程。

这是我要写的：

//don't forgot to add the reference
using HtmlAgilityPack;

//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{

   var pars = new List<string>();

   //first create an HtmlDocument
   HtmlDocument doc = new HtmlDocument();

   //load the html (from a string)
   doc.LoadHtml(sourceHtml);

   //Select all the <p> nodes in a HtmlNodeCollection
   HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");

   //Iterates on every Node in the collection
   foreach (HtmlNode paragraph in paragraphs)
   {
      //Add the InnerText to the list
      pars.Add(paragraph.InnerText); 
      //Or paragraph.InnerHtml depends what you want
   }

   return pars;
}

Run Code Online (Sandbox Code Playgroud)

这只是一个基本的示例，您可以在 html 中包含一些嵌套段落，那么此代码可能无法按预期工作，这完全取决于您正在解析的 html 以及您想要用它做什么。

希望能帮助到你！

归档时间：	11 年，10 月前
查看次数：	1403 次
最近记录：	11 年，10 月前