正则表达式删除HTML标记

Question

正则表达式删除HTML标记

我使用以下Regular Expresion从字符串中删除html标记.它的工作原理除了我留下结束标签.如果我试图删除:<a href="blah">blah</a>它离开了<a/>.

我根本不知道正则表达式语法,并且通过这种方式摸索.拥有RegEx知识的人可以为我提供一个可行的模式.

这是我的代码:

  string sPattern = @"<\/?!?(img|a)[^>]*>";
  Regex rgx = new Regex(sPattern);
  Match m = rgx.Match(sSummary);
  string sResult = "";
  if (m.Success)
   sResult = rgx.Replace(sSummary, "", 1);

Run Code Online (Sandbox Code Playgroud)

我希望删除<a>和<img>标签的第一次出现.

Answer 1

Jar*_*Par 21

使用正则表达式解析HTML充满了陷阱.HTML不是常规语言,因此无法使用正则表达式100%正确解析.这只是您将遇到的许多问题之一.最好的方法是使用HTML/XML解析器为您执行此操作.

这是我写的一篇博客文章的链接,后面会详细介绍这个问题.

http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

话虽这么说,这是一个解决这个特殊问题的解决方案.它绝不是一个完美的解决方案.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;

Run Code Online (Sandbox Code Playgroud)

Answer 2

Joh*_*ohs 17

转此:

'<td>mamma</td><td><strong>papa</strong></td>'

Run Code Online (Sandbox Code Playgroud)

进入这个:

'mamma papa'

Run Code Online (Sandbox Code Playgroud)

您需要用空格替换标记:

.replace(/<[^>]*>/g, ' ')

Run Code Online (Sandbox Code Playgroud)

并将任何重复的空格减少为单个空格:

.replace(/\s{2,}/g, ' ')

Run Code Online (Sandbox Code Playgroud)

然后用以下方法修剪前导和尾随空格:

.trim();

Run Code Online (Sandbox Code Playgroud)

这意味着你的删除标记功能如下所示:

function removeTags(string){
  return string.replace(/<[^>]*>/g, ' ')
               .replace(/\s{2,}/g, ' ')
               .trim();
}

Run Code Online (Sandbox Code Playgroud)

***这是微不足道的，无论出于何种原因都不应该使用。*** 如果您“真的”想要清理 HTML，请使用真正了解 HTML 语法的东西。尝试针对这个输入，加载一个 1px GIF，然后假设 jQuery 存在，加载一个脚本： `<img src="data:image/gif;base64,R0lGODlhAQABAIAAAAP///wAAACwAAAAAA‌QABAACAkQBADs=" onload="$. getScript('evil.js');1<2>3">`. 它不会正确删除该元素，即使它应该这样做。 (4认同)
啊,我想通了,我想出了:function removeTags(string){return string.replace(/<[^>]*>.*?(<[^>]*>)?/ g,''). replace(/\s {2,}/g,'').trim(); } (2认同)

Answer 3

Vad*_*fan 5

为了删除标签之间的空格，您可以使用以下方法在输入 html 的开头和结尾处使用正则表达式和修剪之间的组合：

    public static string StripHtml(string inputHTML)
    {
        const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
        inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();

        string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);

        return noHTML;
    }

Run Code Online (Sandbox Code Playgroud)

所以对于以下输入：

      <p>     <strong>  <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del>   test text  </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>

Run Code Online (Sandbox Code Playgroud)

输出将仅为 html 标签之间没有空格或 html 前后空格的文本：“ test text test 1 test 2 test 3 ”。

请注意，前面的空格test text来自<del> test text </del>html，后面的空格test 3来自 test 3 html。

归档时间：	14 年，11 月前
查看次数：	42862 次
最近记录：	6 年前