相关疑难解决方法(0)

在Python中将XML/HTML实体转换为Unicode字符串

我正在做一些网页抓取,网站经常使用HTML实体来表示非ascii字符.Python是否有一个实用程序,它接受带有HTML实体的字符串并返回unicode类型？

例如:

我回来了:

&#x01ce;

Run Code Online (Sandbox Code Playgroud)

代表带有音标的"ǎ".在二进制中,这表示为16位01ce.我想将html实体转换为值 u'\u01ce'

html python entities

Cri*_*ian

2010 12-16

69
推荐指数

7
解决办法

6万
查看次数

是str.replace(..).replace(..)ad令人厌恶Python中的标准习语吗？

例如,假设我想要一个函数来转义字符串以便在HTML中使用(如在Django的转义过滤器中):

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

Run Code Online (Sandbox Code Playgroud)

这样可行,但它很快变得难看并且似乎具有较差的算法性能(在此示例中,字符串重复遍历5次).更好的是这样的事情:

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        # Note that ampersands must be escaped first; the rest can be escaped in 
        # any order.
        return replace_multi(string.replace('&', '&amp;'),
                             {'<': '&lt;', '>': '&gt;', 
                              "'": '&#39;', '"': '&quot;'})

Run Code Online (Sandbox Code Playgroud)

这样的函数是否存在,或者是使用我之前编写的标准Python习惯用法？

python performance replace idioms

Mic*_*ael

2010 03-21

28
推荐指数

6
解决办法

1万
查看次数