我有一个存储在表格中的Html片段.不是整页,没有标签等,只是基本的格式.
我希望能够在给定页面上显示Html仅作为文本,没有格式化(实际上只是前30到50个字符,但这很容易).
如何将该Html中的"文本"作为直文放入字符串中?
所以这段代码.
<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>
Run Code Online (Sandbox Code Playgroud)
变为:
你好,世界.有没有人在那里?
Jud*_*ngo 90
的自由和开源HtmlAgilityPack具有在其样品中的一个,从HTML转换为纯文本的方法.
var plainText = HtmlUtilities.ConvertToPlainText(string html);
Run Code Online (Sandbox Code Playgroud)
给它一个HTML字符串
<b>你好世界!</ b> <br /> <i>是我!!</ I>
你会得到一个纯文本结果,如:
<b>hello, <i>world!</i></b>
Run Code Online (Sandbox Code Playgroud)
Ben*_*son 44
我无法使用HtmlAgilityPack,所以我为自己写了第二个最佳解决方案
private static string HtmlToPlainText(string html)
{
const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);
var text = html;
//Decode html specific characters
text = System.Net.WebUtility.HtmlDecode(text);
//Remove tag whitespace/line breaks
text = tagWhiteSpaceRegex.Replace(text, "><");
//Replace <br /> with line breaks
text = lineBreakRegex.Replace(text, Environment.NewLine);
//Strip formatting
text = stripFormattingRegex.Replace(text, string.Empty);
return text;
}
Run Code Online (Sandbox Code Playgroud)
vfi*_*lby 23
如果您正在谈论标签剥离,那么如果您不必担心<script>标签这样的问题,则相对简单.如果您只需显示没有标记的文本,则可以使用正则表达式完成该操作:
<[^>]*>
Run Code Online (Sandbox Code Playgroud)
如果你不必担心<script>标签等,那么你需要比正则表达式更强大的功能,因为你需要跟踪状态,更像是一个Context Free Grammar(CFG).虽然你可以通过"从左到右"或非贪婪的匹配来完成它.
如果您可以使用正则表达式,那么有很多网页都有很好的信息:
如果您需要更复杂的CFG行为,我建议使用第三方工具,不幸的是我不知道推荐的好方法.
Geo*_*ker 20
HTTPUtility.HTMLEncode()用于将HTML标记编码为字符串.它会为您解决所有繁重的工作.从MSDN文档:
如果在HTTP流中传递诸如空白和标点符号之类的字符,则它们可能在接收端被误解释.HTML编码将HTML中不允许的字符转换为字符实体等价物; HTML解码反转了编码.例如,当嵌入在文本块中时,字符
<和>被编码为<和>HTTP传输.
HTTPUtility.HTMLEncode()方法,详述在这里:
public static void HtmlEncode(
string s,
TextWriter output
)
Run Code Online (Sandbox Code Playgroud)
用法:
String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();
Run Code Online (Sandbox Code Playgroud)
WEF*_*EFX 10
要添加到vfilby的答案,您只需在代码中执行RegEx替换; 不需要新的课程.如果像我这样的其他新手在这个问题上遇到困难.
using System.Text.RegularExpressions;
Run Code Online (Sandbox Code Playgroud)
然后...
private string StripHtml(string source)
{
string output;
//get rid of HTML tags
output = Regex.Replace(source, "<[^>]*>", string.Empty);
//get rid of multiple blank lines
output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);
return output;
}
Run Code Online (Sandbox Code Playgroud)
Gre*_*Gum 10
更新2023年的答案。答案与往常基本相同:
安装最新的HtmlAgilityPack
创建一个名为HtmlUtilities 的实用程序类,它使用 HtmlAgilityPack。
用它:var plainText = HtmlUtilities.ConvertToPlainText(email.HtmlCode);
以下是从上面的链接复制的 HtmlUtilities 类:
using HtmlAgilityPack;
using System;
using System.IO;
namespace ReadSharp
{
public class HtmlUtilities
{
/// <summary>
/// Converts HTML to plain text / strips tags.
/// </summary>
/// <param name="html">The HTML.</param>
/// <returns></returns>
public static string ConvertToPlainText(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
/// <summary>
/// Count the words.
/// The content has to be converted to plain text before (using ConvertToPlainText).
/// </summary>
/// <param name="plainText">The plain text.</param>
/// <returns></returns>
public static int CountWords(string plainText)
{
return !String.IsNullOrEmpty(plainText) ? plainText.Split(' ', '\n').Length : 0;
}
public static string Cut(string text, int length)
{
if (!String.IsNullOrEmpty(text) && text.Length > length)
{
text = text.Substring(0, length - 4) + " ...";
}
return text;
}
private static void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
private static void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
case "br":
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
将HTML转换为纯文本的三步过程
首先你需要为HtmlAgilityPack安装Nuget包 第二次创建这个类
public class HtmlToText
{
public HtmlToText()
{
}
public string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach(HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch(node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch(node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
}
Run Code Online (Sandbox Code Playgroud)
通过使用上面的课程参考Judah Himango的答案
第三,你需要创建上面的类的对象和使用ConvertHtml(HTMLContent)方法将HTML转换为纯文本而不是ConvertToPlainText(string html);
HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);
Run Code Online (Sandbox Code Playgroud)
它的局限性在于它不会折叠长的行内空格,但绝对是可移植的,并且尊重Web浏览器之类的布局。
static string HtmlToPlainText(string html) {
string buf;
string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
"fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
"noscript|ol|output|p|pre|section|table|tfoot|ul|video";
string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);
// Replace br tag to newline.
buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);
// (Optional) remove styles and scripts.
buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);
// Remove all tags.
buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);
// Replace HTML entities.
buf = WebUtility.HtmlDecode(buf);
return buf;
}
Run Code Online (Sandbox Code Playgroud)
我认为最简单的方法是创建一个“字符串”扩展方法(基于用户理查德的建议):
using System;
using System.Text.RegularExpressions;
public static class StringHelpers
{
public static string StripHTML(this string HTMLText)
{
var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
return reg.Replace(HTMLText, "");
}
}
Run Code Online (Sandbox Code Playgroud)
然后只需在程序中的任何“字符串”变量上使用此扩展方法即可:
var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();
Run Code Online (Sandbox Code Playgroud)
我使用这个扩展方法将 html 格式的注释转换为纯文本,以便它可以正确显示在水晶报表上,并且效果完美!
| 归档时间: |
|
| 查看次数: |
171466 次 |
| 最近记录: |