PHP Regex:如何删除所有HTML标记但不剥离非HTML标记?

mar*_*t15 1 php regex string

使用PHP正则表达式,我如何删除HTML标签(打开和关闭)和使用属性,如<hr class="myclass" />不删除非HTML标签<dog><dog class="cat">

非HTML标记是动态的,不能进行硬编码.

输入:

<b><> <<> <dog> <123> <" !> <!--...--> <!doctype> <hr class="myclass" /> </b>
Run Code Online (Sandbox Code Playgroud)

输出应该是:

<> <<> <dog> <123> <" !>
Run Code Online (Sandbox Code Playgroud)

我正在考虑使用HTML Purifier但首先我需要知道这是否可以在正则表达式中使用.

HTML标记参考:http://www.quackit.com/html/tags/

在此先感谢=)

rid*_*ner 8

要仅匹配(和删除)HTML 4.01元素的开始和结束标记,这个经过测试的PHP函数中的正则表达式将做得非常好:

function strip_HTML_tags($text)
{ // Strips HTML 4.01 start and end tags. Preserves contents.
    return preg_replace('%
        # Match an opening or closing HTML 4.01 tag.
        </?                  # Tag opening "<" delimiter.
        (?:                  # Group for HTML 4.01 tags.
          ABBR|ACRONYM|ADDRESS|APPLET|AREA|A|BASE|BASEFONT|BDO|BIG|
          BLOCKQUOTE|BODY|BR|BUTTON|B|CAPTION|CENTER|CITE|CODE|COL|
          COLGROUP|DD|DEL|DFN|DIR|DIV|DL|DT|EM|FIELDSET|FONT|FORM|
          FRAME|FRAMESET|H\d|HEAD|HR|HTML|IFRAME|IMG|INPUT|INS|
          ISINDEX|I|KBD|LABEL|LEGEND|LI|LINK|MAP|MENU|META|NOFRAMES|
          NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|P|Q|SAMP|
          SCRIPT|SELECT|SMALL|SPAN|STRIKE|STRONG|STYLE|SUB|SUP|S|
          TABLE|TD|TBODY|TEXTAREA|TFOOT|TH|THEAD|TITLE|TR|TT|U|UL|VAR
        )\b                  # End group of tag name alternative.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        /?                   # Tag may be empty (with self-closing "/>" sequence.
        >                    # Opening tag closing ">" delimiter.
        | <!--.*?-->         # Or a (non-SGML compliant) HTML comment.
        | <!DOCTYPE[^>]*>    # Or a DOCTYPE.
        %six', '', $text);
}
Run Code Online (Sandbox Code Playgroud)

CAVEATS:不删除脚本<? ... ?>.将删除这些结构中出现的任何开始或结束标记.无法正确解析通用的SGML兼容注释.不处理短标签.

编辑:添加了DOCTYPE和(非SGML严格)HTML注释的匹配.它现在正确传递OP中的测试数据.

EDIT2以前的版本缺少's'单行修饰符.还添加了短标签警告列表.