Amu*_*nak 4 html php html5 tidy
我正在使用php扩展名tidy-html来清理php输出.我知道整理删除无效标签,甚至无法处理HTML5 doctype,但我使用的标签<menu>曾经是HTML规范.但是,它<ul>无论如何都会改变.
奇怪的是,它没有这样做之前.我改变了整洁的配置,它已经休息了.现在我已经关闭了所有与标签混淆的选项,但它没有帮助.
我的脚本很冗长:
$tidy_config = array(
'char-encoding' => 'utf8',
'output-encoding' => 'utf8',
'output-html' => true,
'numeric-entities' => false,
'ascii-chars' => false,
'doctype' => 'loose',
'clean' => false,
'bare' => false,
'fix-uri' => true,
'indent' => true,
'indent-spaces' => 2,
'tab-size' => 2,
'wrap-attributes' => true,
'wrap' => 0,
'indent-attributes' => true,
'join-classes' => false,
'join-styles' => false,
'fix-bad-comments' => true,
'fix-backslash' => true,
'replace-color' => false,
'wrap-asp' => false,
'wrap-jste' => false,
'wrap-php' => false,
'wrap-sections' => false,
'drop-proprietary-attributes' => false,
'hide-comments' => false,
'hide-endtags' => false,
'drop-empty-paras' => true,
'quote-ampersand' => true,
'quote-marks' => true,
'quote-nbsp' => true,
'vertical-space' => true,
'wrap-script-literals' => false,
'tidy-mark' => true,
'merge-divs' => false,
'repeated-attributes' => 'keep-last',
'break-before-br' => false
);
$tidy_config2 = array(
'tidy-mark' => false,
'vertical-space' => false,
'hide-comments' => true,
'indent-spaces' => 0,
'tab-size' => 1,
'wrap-attributes' => false,
'numeric-entities' => true,
'ascii-chars' => true,
'hide-endtags' => true,
'indent' => false
);
$tidy_config = array_merge($tidy_config, $tidy_config2);
$dtm = preg_match(self::doctypeMatch, $output, $dt);
$output = tidy_repair_string($output, $tidy_config, 'utf8');
// tidy screws up doctype --fixed
if($dtm)
$output = preg_replace(self::doctypeMatch, $dt[0], $output);
$output = preg_replace('!>[\n\r]+<!', '><', $output);
unset($tidy_config);
return $output;
Run Code Online (Sandbox Code Playgroud)
请注意,它比这更复杂(因此两个数组).我刚刚切断了不必要的代码.
根据W3C tidy-html5 fork,新标签的正确配置应该是:
'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
'new-empty-tags' => 'command embed keygen source track wbr',
'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
Run Code Online (Sandbox Code Playgroud)
你会注意到它定义new-blocklevel-tags了一个奇怪的temp标签,它应该是旧的过时 menu标签的替代品,因为@tivie在他的回答中提到你必须更换它.
此外,标签audio和video出现在两者new-blocklevel-tags和new-inline-tags,并改变了整洁输出HTML的方式,因为它是:
<video src="movie.webm">
<track kind="subtitles" label="English" src="subtitles.vtt" srclang="en"></video>
Run Code Online (Sandbox Code Playgroud)
如果你video从new-inline-tags:
<video src="movie.webm">
<track kind="subtitles" label="English" src="subtitles.vtt" srclang="en">
</video>
Run Code Online (Sandbox Code Playgroud)
删除video从new-blocklevel-tags收益率:
<video src="movie.webm">
<track kind="subtitles" label="English" src="subtitles.vtt" srclang="en"></video>
Run Code Online (Sandbox Code Playgroud)
就个人而言,我更喜欢audio并且video表现得像块级标签,但这取决于你.
另外,tags.c它还定义command为as CM_HEAD和embedas CM_IMG.不幸的是,我不知道这些代表什么,我认为不可能效仿它们.
另一件事:如果你没有定义new-empty-tags,你会得到奇怪的输出:
<video src="movie.webm">
<track kind="subtitles" label="English" src="subtitles.vtt" srclang="en">
</track>
</video>
Run Code Online (Sandbox Code Playgroud)
如果您还想支持WHATWG建议,则应添加标记:
这是我的完整方法:
function Tidy5($string, $options = null, $encoding = 'utf8')
{
if (extension_loaded('tidy') === true)
{
$default = array
(
'anchor-as-name' => false,
'break-before-br' => true,
'char-encoding' => $encoding,
'decorate-inferred-ul' => false,
'doctype' => 'omit',
'drop-empty-paras' => false,
'drop-font-tags' => true,
'drop-proprietary-attributes' => false,
'force-output' => false,
'hide-comments' => false,
'indent' => true,
'indent-attributes' => false,
'indent-spaces' => 2,
'input-encoding' => $encoding,
'join-styles' => false,
'logical-emphasis' => false,
'merge-divs' => false,
'merge-spans' => false,
'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
'new-empty-tags' => 'command embed keygen source track wbr',
'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
'newline' => 0,
'numeric-entities' => false,
'output-bom' => false,
'output-encoding' => $encoding,
'output-html' => true,
'preserve-entities' => true,
'quiet' => true,
'quote-ampersand' => true,
'quote-marks' => false,
'repeated-attributes' => 1,
'show-body-only' => true,
'show-warnings' => false,
'sort-attributes' => 1,
'tab-size' => 4,
'tidy-mark' => false,
'vertical-space' => true,
'wrap' => 0,
);
$doctype = $menu = null;
if ((strncasecmp($string, '<!DOCTYPE', 9) === 0) || (strncasecmp($string, '<html', 5) === 0))
{
$doctype = '<!DOCTYPE html>'; $options['show-body-only'] = false;
}
$options = (is_array($options) === true) ? array_merge($default, $options) : $default;
if (strpos($string, '<menu') !== false)
{
$menu = array
(
'<menu' => '<menutidy',
'</menu' => '</menutidy',
);
}
if (isset($menu) === true)
{
$string = str_replace(array_keys($menu), $menu, $string);
}
$string = tidy_repair_string($string, $options, $encoding);
if (empty($string) !== true)
{
if (isset($menu) === true)
{
$string = str_replace($menu, array_keys($menu), $string);
}
if (isset($doctype) === true)
{
$string = $doctype . "\n" . $string;
}
return $string;
}
}
return false;
}
Run Code Online (Sandbox Code Playgroud)
我不认为我的答案非常......整洁.将HTMLTidy与HTML5(目前它不支持)结合使用更是一种愚蠢的方式.为了实现这一点,我使用正则表达式来解析HTML,根据大多数情况,HTML是所有邪恶或cthulhu方式的根源.如果有人知道更好的方法,请启发我们,因为我觉得使用正则表达式解析html并不安全.我用很多例子测试了它,但我很确定它不是防弹.
菜单标记在HTML4和XHTML1中已弃用,由ul(无序列表)替换.但是,它在HTML5中重新定义,因此是符合HTML5规范的有效标记.由于HTMLTidy不支持HTML5并使用XHTML或HTML规范,正如OP指出的那样,它将当时不推荐使用的标签菜单替换为ul(或添加ul标签),即使你明确告诉它不要.
此函数在使用整理解析之前用自定义标记替换菜单标记.然后它再次用菜单替换自定义标签.
function tidyHTML5($buffer)
{
$buffer = str_replace('<menu', '<mytag', $buffer);
$buffer = str_replace('menu>', 'mytag>', $buffer);
$tidy = new tidy();
$options = array(
'hide-comments' => true,
'tidy-mark' => false,
'indent' => true,
'indent-spaces' => 4,
'new-blocklevel-tags' => 'menu,mytag,article,header,footer,section,nav',
'new-inline-tags' => 'video,audio,canvas,ruby,rt,rp',
'doctype' => '<!DOCTYPE HTML>',
//'sort-attributes' => 'alpha',
'vertical-space' => false,
'output-xhtml' => true,
'wrap' => 180,
'wrap-attributes' => false,
'break-before-br' => false,
'char-encoding' => 'utf8',
'input-encoding' => 'utf8',
'output-encoding' => 'utf8'
);
$tidy->parseString($buffer, $options, 'utf8');
$tidy->cleanRepair();
$html = '<!DOCTYPE HTML>' . PHP_EOL . $tidy->html();
$html = str_replace('<html lang="en" xmlns="http://www.w3.org/1999/xhtml">', '<html>', $html);
$html = str_replace('<html xmlns="http://www.w3.org/1999/xhtml">', '<html>', $html);
//Hackish stuff starts here
//We use regex to parse html, which is usually a bad idea
//But currently there is no alternative to it, since tidy is not MENU TAG friendly
preg_match_all('/\<mytag(?:[^\>]*)\>\s*\<ul>/', $html, $matches);
foreach($matches as $m) {
$mo = $m;
$m = str_replace('mytag', 'menu', $m);
$m = str_replace('<ul>', '', $m);
$html = str_replace($mo, $m, $html);
}
$html = str_replace('<mytag', '<menu', $html);
$html = str_replace('</ul></mytag>', '</menu>', $html);
$html = str_replace('mytag>', 'menu>', $html);
return $html;
}
Run Code Online (Sandbox Code Playgroud)
测试:
header("Content-type: text/plain");
echo tidyHTML5('<menu><li>Lorem ipsum</li></menu><div></div><menu ><a href="#">lala</a><form id="jj"><button>btn</button></form></menu><menu style="color: white" id="nhecos"><li>blabla</li><li>sdfsdfsdf</li></menu>');
Run Code Online (Sandbox Code Playgroud)
OUTPUT:
<!DOCTYPE HTML>
<html>
<head>
<title></title>
</head>
<body>
<menu>
<li>Lorem ipsum
</li>
</menu><menu style="color: white" id="nhecos">
<li>blabla
</li>
<li>sdfsdfsdf
</li>
</menu>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)