当我用Python中的utf-8文件打印文本时,为什么我看不到希伯来字符呢？

Question

当我用Python中的utf-8文件打印文本时,为什么我看不到希伯来字符呢？

我正在尝试从文本文件中读取希伯来语:

def task1():
    f = open('C:\\Users\\royi\\Desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8")
    print 'success'
    return f

a = task1()

Run Code Online (Sandbox Code Playgroud)

当我读它时,它告诉我这个:

'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5

Run Code Online (Sandbox Code Playgroud)

还有很多.

我怎么看？

Answer 1

e-s*_*tis 5

你这样打印:

print task1().encode('your terminal encoding here')

Run Code Online (Sandbox Code Playgroud)

您必须确保您的终端能够显示希伯来字符.例如,在安装了希伯来语语言环境的完整utf-8 Linux发行版下:

print task1().encode('utf-8')

Run Code Online (Sandbox Code Playgroud)

小心open:

使用python 2.7,您没有编码参数.使用该codecs模块.
使用python 3+,编码参数是第四个,而不是像你那样的第三个.你可能意味着什么open(path, 'r', encoding='utf-8').你甚至可以省略'r'.

那你为什么要用encode？

好吧,当你读取一个文件并告诉Python编码时,它会返回一个unicode对象,而不是string对象.例如在我的系统上:

>>> import codecs
>>> content = codecs.open('/etc/fstab', encoding='utf-8').read()
>>> type(content)
<type 'unicode'>
>>> type('')
<type 'str'>
>>> type(u'')
<type 'unicode'>

Run Code Online (Sandbox Code Playgroud)

如果要使其成为可打印的字符串(如果它包含非ascii字符),则需要将其编码回字符串:

>>> type(content.encode('utf-8'))
<type 'str'>

Run Code Online (Sandbox Code Playgroud)

我们使用encode因为在这里我们正在谈论一个或多或少的通用文本对象(unicode与文本操作一样通用),然后将它(编码)转换为特定的表示形式(utf-8).

我们需要这个特定的表示,因为你的系统不关于Python内部,如果你没有指定编码,只能打印ascii字符.因此,当您输出时,您专门编码为您的系统可以理解的编码.对我来说,幸运的是'utf-8',所以很容易.如果你在Windows上,它可能会变得棘手.

归档时间：	14 年，7 月前
查看次数：	3040 次
最近记录：	12 年，8 月前