.Net从html页面中删除javascript和css代码块

Iev*_*ida 8 .net html c# regex

我有html作为javascript和css代码块的字符串.

像这样的东西:

<script type="text/javascript">

  alert('hello world');

</script>

<style type="text/css">
  A:link {text-decoration: none}
  A:visited {text-decoration: none}
  A:active {text-decoration: none}
  A:hover {text-decoration: underline; color: red;}
</style>
Run Code Online (Sandbox Code Playgroud)

但我不需要它们.如何用reqular表达式删除那些块?

Eli*_*ing 16

快速'n'脏方法将是这样的正则表达式:

var regex = new Regex(
   "(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", 
   RegexOptions.Singleline | RegexOptions.IgnoreCase
);

string ouput = regex.Replace(input, "");
Run Code Online (Sandbox Code Playgroud)

更好的*(但可能更慢)选项是使用HtmlAgilityPack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlInput);

var nodes = doc.DocumentNode.SelectNodes("//script|//style");

foreach (var node in nodes)
    node.ParentNode.RemoveChild(node);

string htmlOutput = doc.DocumentNode.OuterHtml;
Run Code Online (Sandbox Code Playgroud)

*)有关为何更好的讨论,请参阅此主题.

  • 你知道吗[Tony The Pony](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)? (3认同)
  • @GvS:我添加了一个使用HtmlAgilityPack的示例. (3认同)