如何在PHP中检测无效的html实体?

car*_*pii 8 php iconv html-entities

我有一堆正在处理的文本/ html文档

其中一些包含编码的html实体,我试图将其转换为原始解码的utf字符.

这很容易使用html_entity_decode,但是,某些实体是无效的,例如

򙦙
Run Code Online (Sandbox Code Playgroud)

出于这个原因,我正在使用正则表达式来提取每个单独的实体,然后尝试以某种方式验证它们.

如果一个实体无效,我想把它保留򙦙在文档中,但像编码的东西&仍然会变成&.

只是一些示例测试代码我敲了..

<?php
function dump_chars($s)
{
    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = html_entity_decode($m, ENT_QUOTES, "UTF-8");

            echo "[" . htmlentities($m, ENT_QUOTES, "UTF-8") . "] ";
            echo "Decoded: [" . $decoded . "] ";
            echo "Hex: [" . bin2hex($decoded) . "] "; 
            echo "detect: [" . mb_detect_encoding($decoded) . "]";
            echo "<br>";
        }
    }
}

$payload = "&quot; &amp; &#x349; &#x92; &#x99999;";
echo "<html><head><meta charset='UTF-8'></head><body>";
dump_chars($payload);
Run Code Online (Sandbox Code Playgroud)

我对如何最好地验证实体感到有点空白,请给我一些帮助.

car*_*pii 2

我终于找到了办法..

function decode_numeric_entities($s)
{
    $result = $s;
    $convmap = array(0x0, 0x2FFFF, 0, 0xFFFF);

    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = mb_decode_numericentity($m, $convmap, 'UTF-8');
            $result = str_replace($m, $decoded, $result);
        }
    }
    return $result;
}
Run Code Online (Sandbox Code Playgroud)

通过此函数运行字符串会将所有有效实体转换为其实际的 utf 字符,将所有无效实体保留为实体