在python中将拉丁字符串转换为unicode

Question

在python中将拉丁字符串转换为unicode

我正在使用scrapy,我抓了一些网站并将抓取页面中的项目存储到json文件中,但其中一些包含以下格式.

l = ["Holding it Together",
     "Fowler RV Trip",
     "S\u00e9n\u00e9gal - Mali - Niger","H\u00eatres et \u00e9tang",
     "Coll\u00e8ge marsan","N\u00b0one",
     "Lines through the days 1 (Arabic) \u0633\u0637\u0648\u0631 \u0639\u0628\u0631 \u0627\u0644\u0623\u064a\u0627\u0645 1",
     "\u00cdndia, Tail\u00e2ndia &amp; Cingapura"]

Run Code Online (Sandbox Code Playgroud)

我可以预期该列表包含不同的格式,但我想转换它并将列表中的字符串与其原始名称一起存储,如下所示

l = ["Holding it Together",
     "Fowler RV Trip",
     "Lines through the days 1 (Arabic) ???? ??? ?????? 1 | ??? ????? ? | Blogs"         ,
     "Índia, Tailândia & Cingapura "]

Run Code Online (Sandbox Code Playgroud)

提前致谢...........

Answer 1

sch*_*mar 7

您有包含unicode转义的字节字符串.您可以使用unicode_escape编解码器将它们转换为unicode :

>>> print "H\u00eatres et \u00e9tang".decode("unicode_escape")
Hêtres et étang

Run Code Online (Sandbox Code Playgroud)

您可以将其编码回字节字符串:

>>> s = "H\u00eatres et \u00e9tang".decode("unicode_escape")
>>> s.encode("latin1")
'H\xeatres et \xe9tang'

Run Code Online (Sandbox Code Playgroud)

您可以过滤和解码非unicode字符串,如:

for s in l: 
    if not isinstance(s, unicode): 
        print s.decode('unicode_escape')

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，6 月前
查看次数：	3315 次
最近记录：	13 年，5 月前