从文本文件中读取非ASCII字符

Question

从文本文件中读取非ASCII字符

我正在使用python 2.7.我尝试了许多像编解码器这样的东西,但没有用.我怎样才能解决这个问题.

myfile.txt文件

wörd

Run Code Online (Sandbox Code Playgroud)

我的代码

f = open('myfile.txt','r')
for line in f:
    print line
f.close()

Run Code Online (Sandbox Code Playgroud)

产量

s\xc3\xb6zc\xc3\xbck

Run Code Online (Sandbox Code Playgroud)

eclipse和命令窗口的输出相同.我正在使用Win7.当我不从文件中读取时,任何字符都没有问题.

Answer 1

Bir*_*ash 12

import codecs
#open it with utf-8 encoding 
f=codecs.open("myfile.txt","r",encoding='utf-8')
#read the file to unicode string
sfile=f.read()

#check the encoding type
print type(file) #it's unicode

#unicode should be encoded to standard string to display it properly
print sfile.encode('utf-8')
#check the type of encoded string

print type(sfile.encode('utf-8'))

Run Code Online (Sandbox Code Playgroud)

Answer 2

lav*_*ton 7

首先 - 检测文件的编码


  from chardet import detect
  encoding = lambda x: detect(x)['encoding']
  print encoding(line)

Run Code Online (Sandbox Code Playgroud)

然后 - 将其转换为unicode或您的默认编码str:


  n_line=unicode(line,encoding(line),errors='ignore')
  print n_line
  print n_line.encode('utf8')

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，7 月前
查看次数：	17945 次
最近记录：	7 年，6 月前