将HTML实体转换为Unicode,反之亦然

Question

将HTML实体转换为Unicode,反之亦然

可能重复:

在Python中将XML/HTML实体转换为Unicode字符串

HTML实体代码到文本

如何在Python中将HTML实体转换为Unicode,反之亦然？

Answer 1

至于"反之亦然"(我需要自己,让我找到这个问题,这没有帮助,随后另一个有答案的网站):

u'some string'.encode('ascii', 'xmlcharrefreplace')

Run Code Online (Sandbox Code Playgroud)

将返回一个纯字符串,其中任何非ascii字符都转换为XML(HTML)实体.

Answer 2

hek*_*ran 29

你需要有BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&amp;' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&amp;'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Run Code Online (Sandbox Code Playgroud)

BeautifulSoup api已经改变了.请参阅最新的[doc](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). (2认同)
迫切需要Python3更新。 (2认同)

Answer 3

msc*_*arf 19

Python 2.7和BeautifulSoup4的更新

Unescape - 用于解码的Unicode HTML htmlparser(Python 2.7标准库):

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Run Code Online (Sandbox Code Playgroud)

Unescape - 使用bs4(BeautifulSoup4)unicode的Unicode HTML :

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Run Code Online (Sandbox Code Playgroud)

Escape - 使用bs4(BeautifulSoup4)将Unicode解码为HTML :

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

Run Code Online (Sandbox Code Playgroud)

赞成显示没有依赖项的标准库解决方案 (2认同)

Answer 4

Ped*_*ito 15

用来：python3html.unescape()

import html
s = "&amp;"
u = html.unescape(s)
# &

Run Code Online (Sandbox Code Playgroud)

Answer 5

AXO*_*AXO 11

作为hekevintran回答表明,你可以使用cgi.escape(s)编码蜇伤,但要注意报价是编码默认是在功能虚假,它可能是一个好主意,通过quote=True旁边的字符串关键字参数.但即使通过quote=True,该函数也不会转义单引号("'")(由于这些问题,该函数自版本3.2以来已被弃用)

有人建议使用html.escape(s)而不是cgi.escape(s).(3.2版中新增功能)

也html.unescape(s)已在3.4版中引入.

所以在python 3.4中你可以:

使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()特殊字符转换为HTML实体.
而html.unescape(text)转换的HTML实体回纯文本表示.

Answer 6

Jan*_*lik 6

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—

Run Code Online (Sandbox Code Playgroud)

HTML 只严格要求&(&) 和<(左尖括号/小于号) 被转义。https://html.spec.whatwg.org/multipage/parsing.html#data-state

归档时间：	16 年，10 月前
查看次数：	63061 次
最近记录：	6 年，3 月前