删除String中的HTML标记

Question

删除String中的HTML标记

如何从以下字符串中删除HTML标记？

<P style="MARGIN: 0cm 0cm 10pt" class=MsoNormal><SPAN style="LINE-HEIGHT: 115%; 
FONT-FAMILY: 'Verdana','sans-serif'; COLOR: #333333; FONT-SIZE: 9pt">In an 
email sent just three days before the Deepwater Horizon exploded, the onshore 
<SPAN style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> manager in charge of 
the drilling rig warned his supervisor that last-minute procedural changes were 
creating "chaos". April emails were given to government investigators by <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> and reviewed by The Wall 
Street Journal and are the most direct evidence yet that workers on the rig 
were unhappy with the numerous changes, and had voiced their concerns to <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN>’s operations managers in 
Houston. This raises further questions about whether <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> managers properly 
considered the consequences of changes they ordered on the rig, an issue 
investigators say contributed to the disaster.</SPAN></p><br/>

Run Code Online (Sandbox Code Playgroud)

我正在将它写入Asponse.PDF,但HTML标签显示在PDF中.我该如何删除它们？

Answer 1

cap*_*gon 94

警告: This does not work for all cases and should not be used to process untrusted user input.

using System.Text.RegularExpressions;
...
const string HTML_TAG_PATTERN = "<.*?>";

static string StripHTML (string inputString)
{
   return Regex.Replace 
     (inputString, HTML_TAG_PATTERN, string.Empty);
}

Run Code Online (Sandbox Code Playgroud)

-1您不应该使用正则表达式来解析像HTML这样的无上下文语法.如果HTML由某个外部实体提供,则可以轻松地操纵它来规避正则表达式. (11认同)
`public static string StripTagsCharArray(string source){char [] array = new char [source.Length]; int arrayIndex = 0; bool inside = false; for(int i = 0; i <source.Length; i ++){char let = source [i]; if(let =='<'){inside = true; 继续; } if(let =='>'){inside = false; 继续; } if(!inside){array [arrayIndex] = let; arrayIndex ++; 返回new string(array,0,arrayIndex); 它比Regex快大约8倍 (7认同)
@capdragon此外,人们从他们在SO上看到的例子中推断出来.最终有人会读到这个并尝试重写它只是删除<script>标签,他们不会意识到它特别不适合XSS预防(因为它很容易被欺骗).在SO上,我认为应该为阅读它的一般观众编写解决方案,而不仅仅是针对提出问题的单个人.(否则,为什么要首先公开发布问题和答案？) (6认同)
如果你想要有效的HTML5,那么`<p data-foo =">"> Bar </ script>`怎么样？但请记住,有些人*将*使用您的代码处理未知来源的HTML,并且HTML不保证有效!如果你的前言是"我警告:这不适用于所有情况,不应该用于处理不受信任的用户输入",我会支持你的答案.我怀疑你有58个赞成,因为这个星球上有58个人(生者和死者)不知道或不介意你的解决方案不正确的测试用例. (4认同)
@mehaase 在很大程度上我同意。但谁说过有关解析的事呢？他只是想删除标签。使用正则表达式真正解析 html 与使用正则表达式搜索或匹配某些 html 之间必须始终存在根本区别。 (2认同)
@capdragon 没有区别。为了*正确*地转换文档，您必须根据管理它的上下文无关语法的规则来解析它。（强调“正确”一词。您的示例适用于某些测试用例，但在一般情况下并不“正确”。）真正的常规语言不能用于解析上下文无关语法（请参阅：乔姆斯基的层次结构）。在此字符串上运行代码：`<p foo=">">Bar</script>` 结果应该是“Bar”，但您的代码会生成“">Bar”。 (2认同)
@mehaase 我们必须同意不同意，因为我认为你在两个帐户上都是错误的：1）存在差异，`<p foo=">">Bar</script>` 不是 HTML。2）这个问题有超过50个赞成票，所以你如何得出这个结论不适合一般受众并且对提出这个问题的单个人有利的结论超出了我的范围。此外，我注意到您没有发布问题的答案，也许您可以提出更好的解决方案？ (2认同)
@mehaase足够公平.我做了改变,谢谢. (2认同)

Answer 2

SLa*_*aks 10

您应该使用HTML Agility Pack:

HtmlDocument doc = ...
string text = doc.DocumentElement.InnerText;

Run Code Online (Sandbox Code Playgroud)

我真的不明白为什么人们会给出使用Agility Pack的答案,因为正文的.InnerText(作为示例)不会呈现无标记的字符串.有很多人在SO上获得敏捷包,然后想知道为什么他们仍然盯着标记,脚本标签. (28认同)

归档时间：	14 年，10 月前
查看次数：	76489 次
最近记录：	11 年，9 月前