Python:ascii编解码器不能编码en-dash

Question

Python:ascii编解码器不能编码en-dash

the*_*ird 4 printing utf-8 non-ascii-characters utf8-decode python-2.7

我正在尝试使用支持CP437编码的热敏打印机从诗歌基金会的每日诗歌RSS源中打印一首诗.这意味着我需要翻译一些角色; 在这种情况下,连字符连字符.但python甚至不会编码en dash开头.当我尝试解码字符串并用连字符替换en-dash时出现以下错误:

Traceback (most recent call last):
  File "pftest.py", line 46, in <module>
    str = str.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 140: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

这是我的代码:

#!/usr/bin/python
#-*- coding: utf-8 -*-

# This string is actually a variable entitled d['entries'][1].summary_detail.value
str = """Love brought by night a vision to my bed,
One that still wore the vesture of a child
But eighteen years of age – who sweetly smiled"""

str = str.decode('utf-8')
str = str.replace("\u2013", "-") #en dash
str = str.replace("\u2014", "--") #em dash
print (str)

Run Code Online (Sandbox Code Playgroud)

我实际上可以在终端窗口(Mac)中使用以下代码打印输出而没有错误,但我的打印机会喷出3个CP437字符集:

str = u''.str.encode('utf-8')

Run Code Online (Sandbox Code Playgroud)

我使用Sublime Text作为我的编辑器,我用UTF-8编码保存了页面,但我不确定这会有什么帮助.我非常感谢您对此代码的任何帮助.谢谢!

Answer 1

Jon*_*caC 9

我不完全理解你的代码中发生了什么,但我也一直试图用从网上得到的字符串中的连字符替换en-dashes,这就是对我有用的东西.我的代码就是这样:

txt = re.sub(u"\u2013", "-", txt)

Run Code Online (Sandbox Code Playgroud)

我正在使用Python 2.7和Sublime Text 2,但我不打扰-*- coding: utf-8 -*-在我的脚本中设置,因为我试图不引入任何新的编码问题.(即使我的变量可能包含Unicode,我也希望保持我的代码纯ASCII.)您是否需要在.py文件中包含Unicode ,或者只是为了帮助调试？

我会注意到我的txt变量已经是一个unicode字符串,即

print type(txt)

Run Code Online (Sandbox Code Playgroud)

产生

<type 'unicode'>

Run Code Online (Sandbox Code Playgroud)

我很想知道type(str)你的情况会产生什么.

我在你的代码中注意到的一件事是

str = str.replace("\u2013", "-") #en dash

Run Code Online (Sandbox Code Playgroud)

你确定做了什么吗？我的理解是,\u只在u""字符串中表示"unicode character" ,你在那里创建的是一个包含5个字符的字符串,一个"u",一个"2",一个"0"等.(第一个字符是因为你可以转义任何字符,如果没有特殊含义,比如'\n'或'\ t',它就会忽略反斜杠.)

此外,您从打印机获得3个CP437字符的事实让我怀疑您的字符串中仍然有一个短划线.en-dash的UTF-8编码是3个字节:0xe2 0x80 0x93.当您调用str.encode('utf-8')包含en-dash的unicode字符串时,您将在返回的字符串中获得这三个字节.我猜你的终端知道如何把它解释为一个冲刺,这就是你所看到的.

如果你不能让我的第一种方法工作,我会提到我也成功了:

txt = txt.encode('utf-8')
txt = re.sub("\xe2\x80\x93", "-", txt)

Run Code Online (Sandbox Code Playgroud)

re.sub()如果你在打电话之后把它放进去,也许这对你有用encode().在那种情况下,你根本不需要那个电话decode().我承认我真的不明白为什么会这样.

归档时间：	10 年，8 月前
查看次数：	9265 次
最近记录：	10 年，5 月前