Har*_*oon 6 c# html-parsing .net-3.5 html-agility-pack
我可以解析文档并生成输出,但是由于ap标记,输出无法解析为XElement,字符串中的所有其他内容都被正确解析.
我的意见:
var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";
Run Code Online (Sandbox Code Playgroud)
我的代码:
public static XElement CleanupHtml(string input)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
//htmlDoc.OptionWriteEmptyNodes = true;
//htmlDoc.OptionAutoCloseOnEnd = true;
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(input);
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
}
else
{
if (htmlDoc.DocumentNode != null)
{
var ndoc = new HtmlDocument(); // HTML doc instance
HtmlNode p = ndoc.CreateElement("body");
p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
var result = p.OuterHtml.Replace("<br>", "<br/>");
result = result.Replace("<br class=\"special_class\">", "<br/>");
result = result.Replace("<hr>", "<hr/>");
return XElement.Parse(result, LoadOptions.PreserveWhitespace);
}
}
return new XElement("body");
}
Run Code Online (Sandbox Code Playgroud)
我的输出:
<body>
<p> Not sure why is is null for some wierd reason chappy!
<br/>
<br/>I have implemented the auto save feature, but does it really work after 100s?
<br/>
</p>
<p>
<i>Autosave?? </i>
</p>
<p>we are talking...</p>
**<p>**
<hr/>
<p>
<br/>
</p>
</body>
Run Code Online (Sandbox Code Playgroud)
粗体p标签是没有正确输出的标签......有没有办法解决这个问题?我做错了代码吗?
你要做的是基本上将Html输入转换为Xml输出.
当您使用该OptionOutputAsXml 选项时,Html Agility Pack可以做到这一点,但在这种情况下,您不应该使用InnerHtml属性,而是让Html Agility Pack使用HtmlDocument的Save方法之一为您做好准备.
这是一个将Html文本转换为XElement实例的通用函数:
public static XElement HtmlToXElement(string html)
{
if (html == null)
throw new ArgumentNullException("html");
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
doc.LoadHtml(html);
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
using (StringReader reader = new StringReader(writer.ToString()))
{
return XElement.Load(reader);
}
}
}
Run Code Online (Sandbox Code Playgroud)
如您所见,您不必自己做太多工作!请注意,由于您的原始输入文本没有根元素,因此Html Agility Pack将自动添加一个封闭SPAN以确保输出有效Xml.
在你的情况下,你想要另外处理一些标签,所以,这里是如何处理你的例子:
public static XElement CleanupHtml(string input)
{
if (input == null)
throw new ArgumentNullException("input");
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
doc.LoadHtml(input);
// extra processing, remove some attributes using DOM
HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
if (coll != null)
{
foreach (HtmlNode node in coll)
{
node.Attributes.Remove("class");
}
}
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
using (StringReader reader = new StringReader(writer.ToString()))
{
return XElement.Load(reader);
}
}
}
Run Code Online (Sandbox Code Playgroud)
如您所见,您不应该使用原始字符串函数,而是使用Html Agility Pack DOM函数(SelectNodes,Add,Remove等等).