我只是想导入一个中文txt文件并打印出内容.这是我从网上复制的txt文件的内容,简体中文:http://stock.hexun.com/2013-06-01/154742801.html
起初,我尝试了这个:
userinput = raw_input('Enter the name of a file')
f=open(userinput,'r')
print f.read()
f.close()
Run Code Online (Sandbox Code Playgroud)
它可以打开文件并打印,但显示的内容是乱码.然后我用编码尝试了以下一个:
#coding=UTF-8
userinput = raw_input('Enter the name of a file')
import codecs
f= codecs.open(userinput,"r","UTF-8")
str1=f.read()
print str1
f.close()
Run Code Online (Sandbox Code Playgroud)
但是,它显示了一条错误消息.UnicodeEncodeError:'cp950编解码器无法在位置50编码字符u'\ u76d8':非法的mutibyte序列.
为什么会发生错误?怎么解决?我尝试过像Big5,cp950这样的其他unicode ......但它仍然无效.
我目前正在使用pyteaser进行汇总,效果很好.我正在查看源代码,但即使借助下面的评论,我也不理解以下编码.任何人都可以解释一下吗?
def split_sentences(text):
'''
The regular expression matches all sentence ending punctuation and splits the string at those points.
At this point in the code, the list looks like this ["Hello, world", "!" ... ]. The punctuation and all quotation marks
are separated from the actual text. The first s_iter line turns each group of two items in the list into a tuple,
excluding the last item in the list (the last item in the list does not need to …
Run Code Online (Sandbox Code Playgroud)