Isa*_*aac 93

至于"反之亦然"(我需要自己,让我找到这个问题,这没有帮助,随后另一个有答案的网站):

u'some string'.encode('ascii', 'xmlcharrefreplace')
Run Code Online (Sandbox Code Playgroud)

将返回一个纯字符串,其中任何非ascii字符都转换为XML(HTML)实体.


hek*_*ran 29

你需要有BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;
Run Code Online (Sandbox Code Playgroud)

  • BeautifulSoup api已经改变了.请参阅最新的[doc](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). (2认同)
  • 迫切需要Python3更新。 (2认同)

msc*_*arf 19

Python 2.7和BeautifulSoup4的更新

Unescape - 用于解码的Unicode HTML htmlparser(Python 2.7标准库):

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Run Code Online (Sandbox Code Playgroud)

Unescape - 使用bs4(BeautifulSoup4)unicode的Unicode HTML :

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Run Code Online (Sandbox Code Playgroud)

Escape - 使用bs4(BeautifulSoup4)将Unicode解码为HTML :

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
Run Code Online (Sandbox Code Playgroud)

  • 赞成显示没有依赖项的标准库解决方案 (2认同)

Ped*_*ito 15

用来:python3html.unescape()

import html
s = "&amp;"
u = html.unescape(s)
# &
Run Code Online (Sandbox Code Playgroud)


AXO*_*AXO 11

作为hekevintran回答表明,你可以使用cgi.escape(s)编码蜇伤,但要注意报价是编码默认是在功能虚假,它可能是一个好主意,通过quote=True旁边的字符串关键字参数.但即使通过quote=True,该函数也不会转义单引号("'")(由于这些问题,该函数自版本3.2以来已被弃用)

有人建议使用html.escape(s)而不是cgi.escape(s).(3.2版中新增功能)

html.unescape(s)在3.4版中引入.

所以在python 3.4中你可以:

  • 使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()特殊字符转换为HTML实体.
  • html.unescape(text)转换的HTML实体回纯文本表示.


Jan*_*lik 6

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—
Run Code Online (Sandbox Code Playgroud)

HTML 只严格要求&(&) 和<(左尖括号/小于号) 被转义。https://html.spec.whatwg.org/multipage/parsing.html#data-state