psy*_*cat 5 python unicode json
我正在编写脚本来将我的链接及其标题从chrome导出到html.
存储为json的Chrome书签,采用utf编码
一些标题使用俄语,因此它们存储如下:
"name":"\ u0425\u0430\u0431\u0440\..."
import codecs
f = codecs.open("chrome.json","r", "utf-8")
data = f.readlines()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = codecs.open("chrome.html","w","utf-8")
fw.write("<html><body>\n")
for n in names:
fw.write(n + '<br>')
# print type(n) # this will return <type 'unicode'> for each url!
fw.write("</body></html>")
Run Code Online (Sandbox Code Playgroud)
现在,在chrome.html中我把那些显示为\ u0425\u0430\u0431 ...
我怎么能把它们变回俄语?
使用python 2.5
s = '\u041f\u0440\u0438\u0432\u0435\u0442 world!'
type(s)
<type 'str'>
print s.decode('raw-unicode-escape').encode('utf-8')
?????? world!
Run Code Online (Sandbox Code Playgroud)
这就是我需要的,将\ u041f ...的str转换为unicode.
f = open("chrome.json", "r")
data = f.readlines()
f.close()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = open("chrome.html","w")
fw.write("<html><body>\n")
for n in names:
fw.write(n.decode('raw-unicode-escape').encode('utf-8') + '<br>')
fw.write("</body></html>")
Run Code Online (Sandbox Code Playgroud)
顺便说一句,这不仅仅是俄语;还有俄语。非 ASCII 字符在页面名称中很常见。例子:
name=u'Python Programming Language \u2013 Official Website'
url=u'http://www.python.org/'
Run Code Online (Sandbox Code Playgroud)
作为脆弱代码的替代方案,例如
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
# (1) relies on name being 2 lines before url
# (2) fails if there is a `"` in the name
# example: "name": "The \"Fubar\" website",
Run Code Online (Sandbox Code Playgroud)
您可以使用 json 模块处理输入文件。对于 Python 2.5,您可以获得simplejson。
这是一个模拟您的脚本:
try:
import json
except ImportError:
import simplejson as json
import sys
def convert_file(infname, outfname):
def explore(folder_name, folder_info):
for child_dict in folder_info['children']:
ctype = child_dict.get('type')
name = child_dict.get('name')
if ctype == 'url':
url = child_dict.get('url')
# print "name=%r url=%r" % (name, url)
fw.write(name.encode('utf-8') + '<br>\n')
elif ctype == 'folder':
explore(name, child_dict)
else:
print "*** Unexpected ctype=%r ***" % ctype
f = open(infname, 'rb')
bmarks = json.load(f)
f.close()
fw = open(outfname, 'w')
fw.write("<html><body>\n")
for folder_name, folder_info in bmarks['roots'].iteritems():
explore(folder_name, folder_info)
fw.write("</body></html>")
fw.close()
if __name__ == "__main__":
convert_file(sys.argv[1], sys.argv[2])
Run Code Online (Sandbox Code Playgroud)
在 Windows 7 Pro 上使用 Python 2.5.4 进行测试。