使用PHP正则表达式,我如何删除HTML标签(打开和关闭)和使用属性,如<hr class="myclass" />不删除非HTML标签<dog>或<dog class="cat">?
非HTML标记是动态的,不能进行硬编码.
输入:
<b><> <<> <dog> <123> <" !> <!--...--> <!doctype> <hr class="myclass" /> </b>
Run Code Online (Sandbox Code Playgroud)
输出应该是:
<> <<> <dog> <123> <" !>
Run Code Online (Sandbox Code Playgroud)
我正在考虑使用HTML Purifier但首先我需要知道这是否可以在正则表达式中使用.
HTML标记参考:http://www.quackit.com/html/tags/
在此先感谢=)
要仅匹配(和删除)HTML 4.01元素的开始和结束标记,这个经过测试的PHP函数中的正则表达式将做得非常好:
function strip_HTML_tags($text)
{ // Strips HTML 4.01 start and end tags. Preserves contents.
return preg_replace('%
# Match an opening or closing HTML 4.01 tag.
</? # Tag opening "<" delimiter.
(?: # Group for HTML 4.01 tags.
ABBR|ACRONYM|ADDRESS|APPLET|AREA|A|BASE|BASEFONT|BDO|BIG|
BLOCKQUOTE|BODY|BR|BUTTON|B|CAPTION|CENTER|CITE|CODE|COL|
COLGROUP|DD|DEL|DFN|DIR|DIV|DL|DT|EM|FIELDSET|FONT|FORM|
FRAME|FRAMESET|H\d|HEAD|HR|HTML|IFRAME|IMG|INPUT|INS|
ISINDEX|I|KBD|LABEL|LEGEND|LI|LINK|MAP|MENU|META|NOFRAMES|
NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|P|Q|SAMP|
SCRIPT|SELECT|SMALL|SPAN|STRIKE|STRONG|STYLE|SUB|SUP|S|
TABLE|TD|TBODY|TEXTAREA|TFOOT|TH|THEAD|TITLE|TR|TT|U|UL|VAR
)\b # End group of tag name alternative.
(?: # Non-capture group for optional attribute(s).
\s+ # Attributes must be separated by whitespace.
[\w\-.:]+ # Attribute name is required for attr=value pair.
(?: # Non-capture group for optional attribute value.
\s*=\s* # Name and value separated by "=" and optional ws.
(?: # Non-capture group for attrib value alternatives.
"[^"]*" # Double quoted string.
| \'[^\']*\' # Single quoted string.
| [\w\-.:]+ # Non-quoted attrib value can be A-Z0-9-._:
) # End of attribute value alternatives.
)? # Attribute value is optional.
)* # Allow zero or more attribute=value pairs
\s* # Whitespace is allowed before closing delimiter.
/? # Tag may be empty (with self-closing "/>" sequence.
> # Opening tag closing ">" delimiter.
| <!--.*?--> # Or a (non-SGML compliant) HTML comment.
| <!DOCTYPE[^>]*> # Or a DOCTYPE.
%six', '', $text);
}
Run Code Online (Sandbox Code Playgroud)
CAVEATS:不删除脚本<? ... ?>.将删除这些结构中出现的任何开始或结束标记.无法正确解析通用的SGML兼容注释.不处理短标签.
编辑:添加了DOCTYPE和(非SGML严格)HTML注释的匹配.它现在正确传递OP中的测试数据.
EDIT2以前的版本缺少's'单行修饰符.还添加了短标签警告列表.
| 归档时间: |
|
| 查看次数: |
3888 次 |
| 最近记录: |