在Python中处理法语字母

Dav*_*ris 9 python string ascii python-2.7 french

我正在从一个包含法语和英文字母的文件中读取数据.我试图构建一个包含所有可能的英文和法文字母的列表(存储为字符串).我使用下面的代码执行此操作:

# encoding: utf-8
def trackLetter(letters, line):
    for a in line:
        found = False;
        for b in letters:
            if b==a:
                found = True
        if not found:
            letters += a

cur_letters = []; # for storing possible letters

data = urllib2.urlopen('https://duolinguist.wordpress.com/2015/01/06/top-5000-words-in-french-wordlist/', 'utf-8')
for line in data:
    trackLetter(cur_letters, line)
    # works if I print here

print cur_letters
Run Code Online (Sandbox Code Playgroud)

此代码打印以下内容:

['t','h','e','o','f','a','n','d','i','r','s','b',' y','w','u','m','l','v','c','p','g','k','x','j','z' ,'q','\ xc3','\ xa0','\ xaa','\ xb9','\ xa9','\ xa8','\ xb4','\ xae',' - ','\xe2','\ x80','\ x99','\ xa2','\ xa7','\ xbb','\ xaf']

尽管我指定了UTF编码,但显然法语字母在某种转换为ASCII时丢失了!奇怪的是当我直接打印出这一行(显示为评论)时,法语字符看起来很完美!

我该怎么做才能保留这些字符(é, è, ê, etc.),或将它们转换回原始版本?

Gre*_*reg 7

它们不会丢失,当您打印列表时它们只是被转义.

当您在Python 2中打印列表时,它会调用__str__列表本身的方法,而不是每个单独的项目,并且列表的__str__方法会转义您的非ascii字符.有关更多解释,请参阅此优秀答案:

str(list)如何工作?

以下代码片段简洁地演示了这个问题:

char_list = ['é', 'è', 'ê']
print(char_list)
# ['\xc3\xa9', '\xc3\xa8', '\xc3\xaa']

print(', '.join(char_list))
# é, è, ê
Run Code Online (Sandbox Code Playgroud)