how to convert characters like these,"a³ a¡ a´a§" in unicode, using python?

Question

how to convert characters like these,"a³ a¡ a´a§" in unicode, using python?

cle*_*ula 5 python string unicode urllib utf-8

i'm making a crawler to get text html inside, i'm using beautifulsoup.

when I open the url using urllib2, this library converts automatically the html that was using portuguese accents like " ã ó é õ " in another characters like these "a³ a¡ a´a§"

what I want is just get the words without accents

contrã¡rio -> contrario

I tried to use this algoritm, bu this one just works when the text uses words like these "olá coração contrário"

   def strip_accents(s):
      return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

Run Code Online (Sandbox Code Playgroud)

Answer 1

shi*_*.ss 1

首先，你必须确保你的爬虫返回的是unicode文本的HTML（例如，Scrapy有一个方法response.body_as_unicode()正是这样做的）

一旦你有了无法理解的 unicode 文本，从 unicode 文本到等效的 ascii 文本的步骤就在这里 - http://pypi.python.org/pypi/Unidecode/0.04.1

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

Run Code Online (Sandbox Code Playgroud)

输出为“北京”

归档时间：	14 年，5 月前
查看次数：	1037 次
最近记录：	14 年，3 月前