BCS*_*BCS 12 html c# text-extraction d
我的问题有点像这个问题,但我有更多的限制:
是否有任何工具可以设置这样做,还是我最好只打破RegexBuddy和C#?
我对命令行或批处理工具以及C/C#/ D库开放.
Sam*_*ron 19
我今天用HTML Agility Pack破解了这段代码,将提取未格式化的修剪文本.
public static string ExtractText(string html)
{
if (html == null)
{
throw new ArgumentNullException("html");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var chunks = new List<string>();
foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
{
if (item.NodeType == HtmlNodeType.Text)
{
if (item.InnerText.Trim() != "")
{
chunks.Add(item.InnerText.Trim());
}
}
}
return String.Join(" ", chunks);
}
Run Code Online (Sandbox Code Playgroud)
如果要保持某种级别的格式,可以使用源提供的示例构建.
public string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode) node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
Run Code Online (Sandbox Code Playgroud)
这是我正在使用的代码:
using System.Web;
public static string ExtractText(string html)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
string s =reg.Replace(html, " ");
s = HttpUtility.HtmlDecode(s);
return s;
}
Run Code Online (Sandbox Code Playgroud)
您可以使用支持从HTML提取文本的NUglify:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
Run Code Online (Sandbox Code Playgroud)
由于它使用的是HTML5自定义解析器,因此它应该非常健壮(特别是如果文档不包含任何错误),并且运行速度非常快(不涉及正则表达式,而是纯递归下降解析器)