hek*_*ran 62 html python html-entities
可能重复:
如何在Python中将HTML实体转换为Unicode,反之亦然?
Isa*_*aac 93
至于"反之亦然"(我需要自己,让我找到这个问题,这没有帮助,随后另一个有答案的网站):
u'some string'.encode('ascii', 'xmlcharrefreplace')
Run Code Online (Sandbox Code Playgroud)
将返回一个纯字符串,其中任何非ascii字符都转换为XML(HTML)实体.
hek*_*ran 29
你需要有BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
Run Code Online (Sandbox Code Playgroud)
msc*_*arf 19
Python 2.7和BeautifulSoup4的更新
Unescape - 用于解码的Unicode HTML htmlparser(Python 2.7标准库):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Run Code Online (Sandbox Code Playgroud)
Unescape - 使用bs4(BeautifulSoup4)unicode的Unicode HTML :
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Run Code Online (Sandbox Code Playgroud)
Escape - 使用bs4(BeautifulSoup4)将Unicode解码为HTML :
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
Run Code Online (Sandbox Code Playgroud)
Ped*_*ito 15
用来:python3html.unescape()
import html
s = "&"
u = html.unescape(s)
# &
Run Code Online (Sandbox Code Playgroud)
AXO*_*AXO 11
作为hekevintran回答表明,你可以使用cgi.escape(s)编码蜇伤,但要注意报价是编码默认是在功能虚假,它可能是一个好主意,通过quote=True旁边的字符串关键字参数.但即使通过quote=True,该函数也不会转义单引号("'")(由于这些问题,该函数自版本3.2以来已被弃用)
有人建议使用html.escape(s)而不是cgi.escape(s).(3.2版中新增功能)
也html.unescape(s)已在3.4版中引入.
所以在python 3.4中你可以:
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()特殊字符转换为HTML实体.html.unescape(text)转换的HTML实体回纯文本表示.$ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
Run Code Online (Sandbox Code Playgroud)
HTML 只严格要求&(&) 和<(左尖括号/小于号) 被转义。https://html.spec.whatwg.org/multipage/parsing.html#data-state