python编码utf-8

Question

python编码utf-8

vek*_*kah 43 python unicode encoding utf-8

我在python中做一些脚本.我创建了一个保存在文件中的字符串.这个字符串有很多数据,来自目录的树状和文件名.根据convmv,我所有的树状花序都是UTF-8.

我想把所有内容都保存在UTF-8中,因为我之后会把它保存在MySQL中.现在,在UTF-8的MySQL中,我遇到了一些问题(比如é或è - 我是法国人).

我希望python总是使用字符串作为UTF-8.我在互联网上阅读了一些信息,我确实喜欢这个.

我的脚本以此开头:

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 def createIndex():
     import codecs
     toUtf8=codecs.getencoder('UTF8')
     #lot of operations & building indexSTR the string who matter
     findex=open('config/index/music_vibration_'+date+'.index','a')
     findex.write(codecs.BOM_UTF8)
     findex.write(toUtf8(indexSTR)) #this bugs!

Run Code Online (Sandbox Code Playgroud)

当我执行时,这里是答案: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

编辑:我看到,在我的文件中,重音很好写.创建此文件后,我将其读取并将其写入MySQL.但我不明白为什么,但我遇到编码问题.我的MySQL数据库是在utf8中,或者似乎是SQL查询SHOW variables LIKE 'char%'只返回utf8或二进制.

我的功能看起来像这样:

#!/usr/bin/python
# -*- coding: utf-8 -*-

def saveIndex(index,date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost','admin','*******','music_vibration')
    sql.charset="utf8"
    findex=open('config/index/'+index,'r')
    lines=findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste=line.split('[:::]')
            artiste=artiste[1].replace('\n','')

            c=sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
            nbr=c.fetchone()
            if nbr[0]==0:
                c=sql.cursor()
                iArt+=1
                c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

Run Code Online (Sandbox Code Playgroud)

在文件中很好地显示的艺人在BDD中写得很糟糕.问题是什么？

Answer 1

Mar*_*ers 55

您不需要编码已编码的数据.当你尝试这样做时,Python将首先尝试解码它,unicode然后才能将其编码回UTF-8.这就是失败的原因:

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

只需直接写您的数据文件,也没有必要编码已编码的数据.

如果您改为构建unicode值,则确实必须将那些可写入的文件编码为文件.你想要使用codecs.open(),它会返回一个文件对象,它会将unicode值编码为UTF-8.

您还真的不想写出UTF-8 BOM,除非您必须支持无法读取UTF-8的Microsoft工具(例如MS Notepad).

对于MySQL插入问题,您需要做两件事:

加入charset='utf8'你的MySQLdb.connect()电话.

在查询或插入时使用unicode对象而不是str对象,但使用sql参数,以便MySQL连接器可以为您做正确的事情:

artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode

c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))

# ...

c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))

Run Code Online (Sandbox Code Playgroud)

如果您codecs.open()以前自动解码内容,它实际上可能会更好:

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

Run Code Online (Sandbox Code Playgroud)

您可能想要了解Unicode和UTF-8以及编码.我可以推荐以下文章:

在Python的Unicode指南
Ned Batchelder的实用Unicode
绝对最低每个软件开发人员绝对必须知道关于Unicode和字符集(没有任何借口!)作者:Joel Spolsky

@vekah:您是否按照[用Python将UTF-8字符串写入MySQL]中的说明进行操作(http://stackoverflow.com/q/6202726) (4认同)

归档时间：	12 年，10 月前
查看次数：	290826 次
最近记录：	8 年，5 月前