如何逃避HTML

Question

如何逃避HTML

我有一个包含HTML文本的字符串.我需要逃避字符串而不是标签.例如,我有包含的字符串,

<ul class="main_nav">
  <li>
    <a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
  </li>
 <li>
  <a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
  </li>
</ul>

Run Code Online (Sandbox Code Playgroud)

如何只删除文本,

<ul class="main_nav">
  <li>
    <a class="className1" id="idValue1" tabindex="2">Test &amp; Sample</a>
  </li>
  <li>
    <a class="className2" id="idValue2" tabindex="2">Test &amp; Sample2</a>
  </li>
</ul>

Run Code Online (Sandbox Code Playgroud)

没有修改标签.

可以用HTML DOM和javascript处理吗？

谢谢

Answer 1

Vit*_*.us 17

我很惊讶没有人回答这个问题.您可以自己使用浏览器为您进行转义.没有正则表达式比让浏览器做最好的处理HTML更好或更安全.

function escapeHTML(str){
    var p = document.createElement("p");
    p.appendChild(document.createTextNode(str));
    return p.innerHTML;
}

Run Code Online (Sandbox Code Playgroud)

或者使用Option()构造函数的简短替代方法

function escapeHTML(str){
    return new Option(str).innerHTML;
}

Run Code Online (Sandbox Code Playgroud)

注意：这会转义`&<>`，但不会转义`"`，因此不适合转义属性值。 (4认同)
当您只需执行“var p = document.createElement('p');p.innerText = str;return p.innerHTML;”时，为什么要创建文本节点？ (2认同)
为什么要专门使用“<option>”元素？ (2认同)

Answer 2

T.J*_*der 10

(进一步了解问题的答案,通过以下OP的评论更新)

可以用HTML DOM和javascript处理吗？

不,一旦文本在DOM中,"转义"它的概念就不适用了.需要对HTML 源文本进行转义,以便将其正确解析为DOM; 一旦它在DOM中,它就不会被转义.

这可能有点难以理解,所以让我们用一个例子.这是一些HTML 源文本(例如在您使用浏览器查看的HTML文件中):

<div>This &amp; That</div>

Run Code Online (Sandbox Code Playgroud)

一旦浏览器将其解析为DOM,div中的文本就是This & That,因为&已经在那时进行了解释.

因此,在浏览器将文本解析为DOM之前,您需要先捕获它.事实上你无法处理它,为时已晚.

另外,如果它有类似的东西,你开始使用的字符串是无效的<div>This & That</div>.预处理无效字符串会很棘手.您不能只使用环境的内置功能(PHP或任何您使用服务器端的功能),因为它们也会转义标签.您需要进行文本处理,仅提取要处理的部分,然后通过转义过程运行这些部分.这个过程将是棘手的.一个&跟空格是很容易的,但如果在源文本转义实体,你怎么知道是否逃避与否？你是否认为如果字符串包含&,你可以不管它？还是把它变成&amp;？(这是完全有效的;它是如何&在HTML页面中显示实际字符串的.)

你真正需要做的是纠正潜在的问题:创建这些无效的半编码字符串的事情.

编辑:从下面的评论流中,问题与您的示例中看起来完全不同(这并不意味着批评).回顾一下那些来到这个新的意见,你说你正从WebKit的这些字符串innerHTML,和我说,很奇怪,innerHTML应该编码&正确(指着你一对夫妇的测试页面,暗示它没有).你的回复是:

这适用于&.但是相同的测试页面不适用于像©,®,«等实体.

这改变了问题的性质.您希望使用字符制作实体,虽然字面上使用时完全有效(假设您的文本编码正确),但可以表示为实体,因此对文本编码更改更具弹性.

我们能做到这一点.根据规范,JavaScript字符串中的字符值为UTF-16(使用Unicode 规范化表单C),并且之前执行源字符编码(ISO 8859-1,Windows-1252,UTF-8等)的任何转换JavaScript运行时会看到它.(如果你不是100%肯定你知道我的字符编码是什么意思,那么现在非常值得停止,去完全阅读绝对最低每个软件开发人员,必须知道Unicode和字符集(没有借口!)由乔尔斯波尔斯基,然后回来.)所以这是输入方面.在输出端,HTML实体标识Unicode代码点.因此,我们可以可靠地从JavaScript字符串转换为HTML实体.

然而,魔鬼一如既往地处于细节之中.JavaScript明确假设每个16位值都是一个字符(参见规范中的8.4节),即使UTF-16实际上并非如此 - 一个16位值可能只是"代理"(例如0xD800)与下一个值组合时有意义,这意味着JavaScript字符串中的两个"字符"实际上是一个字符.这对于远东语言来说并不罕见.

因此,以JavaScript字符串开头并导致HTML实体的强大转换不能假设JavaScript"字符"实际上等于文本中的字符,它必须处理代理.幸运的是,这样做很容易,因为定义Unicode的聪明人使其变得容易:第一个代理值始终在0xD800-0xDBFF(包括)范围内,第二个代理值始终在0xDC00-0xDFFF(包括)范围内.因此,每当您在JavaScript字符串中看到与这些范围匹配的一对"字符"时,您就会处理由代理项对定义的单个字符.在上述链接中给出了从代理值对转换为代码点值的公式,尽管相当迟钝;

有了所有这些信息,我们可以编写一个函数,它将获取一个JavaScript字符串并搜索字符(真实字符,可能是一个或两个"字符"长)您可能想要变成实体,用命名实体替换它们如果我们在命名地图中没有它们,请从地图或数字实体中:

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
    "160": "&nbsp;",
    "161": "&iexcl;",
    "162": "&#cent;",
    "163": "&#pound;",
    "164": "&#curren;",
    "165": "&#yen;",
    "166": "&#brvbar;",
    "167": "&#sect;",
    "168": "&#uml;",
    "169": "&copy;",
    // ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
    "8364": "&euro;"    // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
    // The regular expression below uses an alternation to look for a surrogate pair _or_
    // a single character that we might want to make an entity out of. The first part of the
    // alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
    // alone, it searches for the surrogates. The second part of the alternation you can
    // adjust as you see fit, depending on how conservative you want to be. The example
    // below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
    // character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
    // it's not "printable ASCII" (in the old parlance), convert it. That's probably
    // overkill, but you said you wanted to make entities out of things, so... :-)
    return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
        var high, low, charValue, rep

        // Get the character value, handling surrogate pairs
        if (match.length == 2) {
            // It's a surrogate pair, calculate the Unicode code point
            high = match.charCodeAt(0) - 0xD800;
            low  = match.charCodeAt(1) - 0xDC00;
            charValue = (high * 0x400) + low + 0x10000;
        }
        else {
            // Not a surrogate pair, the value *is* the Unicode code point
            charValue = match.charCodeAt(0);
        }

        // See if we have a mapping for it
        rep = entityMap[charValue];
        if (!rep) {
            // No, use a numeric entity. Here we brazenly (and possibly mistakenly)
            rep = "&#" + charValue + ";";
        }

        // Return replacement
        return rep;
    });
}

Run Code Online (Sandbox Code Playgroud)

您可以通过它传递所有HTML,因为如果这些字符出现在属性值中,您几乎肯定也想在那里对它们进行编码.

我没有在制作中使用上述内容(我实际上是为了这个答案而写的,因为这个问题引起了我的兴趣)并且它完全没有任何保证.我试图确保它处理代理对,因为这对于远东语言来说是必要的,并且支持它们是我们现在应该做的事情,因为世界变得越来越小.

完整的示例页面:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
    font-family: sans-serif;
}
#log p {
    margin:     0;
    padding:    0;
}
</style>
<script type='text/javascript'>

// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {

    // A map of the entities we want to handle.
    // The numbers on the left are the Unicode code point values; their
    // matching named entity strings are on the right.
    var entityMap = {
        "160": "&nbsp;",
        "161": "&iexcl;",
        "162": "&#cent;",
        "163": "&#pound;",
        "164": "&#curren;",
        "165": "&#yen;",
        "166": "&#brvbar;",
        "167": "&#sect;",
        "168": "&#uml;",
        "169": "&copy;",
        // ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
        "8364": "&euro;"    // Last one must not have a comma after it, IE doesn't like trailing commas
    };

    // The function to do the work.
    // Accepts a string, returns a string with replacements made.
    function prepEntities(str) {
        // The regular expression below uses an alternation to look for a surrogate pair _or_
        // a single character that we might want to make an entity out of. The first part of the
        // alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
        // alone, it searches for the surrogates. The second part of the alternation you can
        // adjust as you see fit, depending on how conservative you want to be. The example
        // below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
        // character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
        // it's not "printable ASCII" (in the old parlance), convert it. That's probably
        // overkill, but you said you wanted to make entities out of things, so... :-)
        return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
            var high, low, charValue, rep

            // Get the character value, handling surrogate pairs
            if (match.length == 2) {
                // It's a surrogate pair, calculate the Unicode code point
                high = match.charCodeAt(0) - 0xD800;
                low  = match.charCodeAt(1) - 0xDC00;
                charValue = (high * 0x400) + low + 0x10000;
            }
            else {
                // Not a surrogate pair, the value *is* the Unicode code point
                charValue = match.charCodeAt(0);
            }

            // See if we have a mapping for it
            rep = entityMap[charValue];
            if (!rep) {
                // No, use a numeric entity. Here we brazenly (and possibly mistakenly)
                rep = "&#" + charValue + ";";
            }

            // Return replacement
            return rep;
        });
    }

    // Return the function reference out of the scoping function to publish it
    return prepEntities;
})();

function go() {
    var d = document.getElementById('d1');
    var s = d.innerHTML;
    alert("Before: " + s);
    s = prepEntities(s);
    alert("After: " + s);
}

</script>
</head>
<body>
<div id='d1'>Copyright: &copy; Yen: &yen; Cedilla: &cedil; Surrogate pair: &#65536;</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

在那里,我已经将cedilla作为转换为数字实体而不是命名实体的示例(因为我遗漏cedil了我的非常小的示例地图).请注意,由于JavaScript处理UTF-16的方式,最后的代理对在第一个警报中显示为两个"字符".

归档时间：	15 年，4 月前
查看次数：	6654 次
最近记录：	6 年前