如何使用Python/Django执行HTML解码/编码？

Question

如何使用Python/Django执行HTML解码/编码？

我有一个html编码的字符串:

'''&lt;img class=&quot;size-medium wp-image-113&quot;\
 style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\
 src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot;\
 alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'''

Run Code Online (Sandbox Code Playgroud)

我想将其改为:

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

Run Code Online (Sandbox Code Playgroud)

我希望将其注册为HTML,以便浏览器将其呈现为图像,而不是显示为文本.

我已经在C#中找到了如何做到这一点,但在Python中却没有.有人可以帮我吗？

谢谢.

编辑:有人问为什么我的字符串存储就像那样.这是因为我正在使用网络抓取工具"扫描"网页并从中获取某些内容.该工具(BeautifulSoup)以该格式返回字符串.

有关

在Python中将XML/HTML实体转换为Unicode字符串

Answer 1

Dan*_*aab 112

鉴于Django用例,有两个答案.这是它的django.utils.html.escape功能,供参考:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

Run Code Online (Sandbox Code Playgroud)

为了扭转这一点,Jake的答案中描述的猎豹功能应该有效,但缺少单引号.此版本包含更新的元组,更换顺序颠倒以避免对称问题:

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

Run Code Online (Sandbox Code Playgroud)

然而,这不是一般解决方案; 它仅适用于编码的字符串django.utils.html.escape.更一般地说,坚持使用标准库是个好主意:

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

Run Code Online (Sandbox Code Playgroud)

作为建议:将未转义的HTML存储在数据库中可能更有意义.如果可能的话,值得研究从BeautifulSoup获取未转义的结果,并完全避免这个过程.

使用Django,只能在模板渲染过程中进行转义; 所以为了防止逃避你只是告诉模板引擎不要逃避你的字符串.为此,请在模板中使用以下选项之一:

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

Run Code Online (Sandbox Code Playgroud)

我认为转义只发生在模板渲染过程中的Django中.因此,不需要unescape - 你只是告诉模板引擎不要逃脱.{{context_var | safe}}或{%autoescape off%} {{context_var}} {%endautoescape%} (12认同)
与django.utils.html.escape没有对立面吗？ (4认同)
@Daniel:请将您的评论更改为答案,以便我可以投票!|安全正是我(我相信其他人)在回答这个问题时所寻求的. (3认同)
`html.parser.HTMLParser().unescape()` 在 3.5 中被弃用。改用`html.unescape()`。 (2认同)

Answer 2

Jia*_*ang 110

使用标准库:

HTML Escape

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x

print(escape("<"))

Run Code Online (Sandbox Code Playgroud)

HTML Unescape

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape

print(unescape("&gt;"))

Run Code Online (Sandbox Code Playgroud)

我认为这是最直接的,"包括电池"和正确的答案.我不知道为什么人们投票给那些Django/Cheetah的事情. (12认同)
对于2015年的说明,HTMLParser.unescape在py 3.4中已弃用,在3.5中已删除.使用`from html import unescape`代替 (3认同)
请注意,这不会处理像德语元音("Ü")这样的特殊字符 (2认同)

Answer 3

use*_*294 80

对于html编码,标准库中有cgi.escape:

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

Run Code Online (Sandbox Code Playgroud)

对于html解码,我使用以下内容:

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

Run Code Online (Sandbox Code Playgroud)

对于任何更复杂的东西,我使用BeautifulSoup.

Answer 4

vin*_*ent 20

如果编码字符集相对受限,请使用daniel的解决方案.否则,请使用众多HTML解析库中的一个.

我喜欢BeautifulSoup,因为它可以处理格式错误的XML/HTML:

http://www.crummy.com/software/BeautifulSoup/

对于你的问题,他们的文档中有一个例子

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

Run Code Online (Sandbox Code Playgroud)

对于 BeautifulSoup4，等价物是： `from bs4 import BeautifulSoup` `BeautifulSoup("Sacré bleu!").contents[0]` (2认同)

Answer 5

Col*_*son 10

在Python 3.4+中:

import html

html.unescape(your_string)

Run Code Online (Sandbox Code Playgroud)

Answer 6

zgo*_*oda 8

请参阅本页底部的Python wiki,"unescape"html至少有2个选项.

Answer 7

小智 7

如果有人正在寻找一种简单的方法来通过 django 模板执行此操作，您可以随时使用如下过滤器：

<html>
{{ node.description|safe }}
</html>

Run Code Online (Sandbox Code Playgroud)

我有一些来自供应商的数据，我发布的所有内容都有 html 标签实际上写在渲染的页面上，就像您正在查看源代码一样。

Answer 8

dfr*_*kow 6

丹尼尔的评论作为答案:

"转义只发生在Django模板渲染过程中.因此,不需要unescape - 你只需告诉模板引擎不要逃脱.{{context_var | safe}}或{%autoescape off%} {{context_var}} { %endautoescape%}"

Answer 9

slo*_*ant 5

我发现了一个很好的功能:http://snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

Run Code Online (Sandbox Code Playgroud)

归档时间：	17 年，2 月前
查看次数：	165604 次
最近记录：	6 年，9 月前