Nem*_*den 10 python transliteration
我得到UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)异常尝试使用string.maketrans在Python.我对以下代码(gist)中的这种错误感到气馁:
# -*- coding: utf-8 -*-
import string
def translit1(string):
""" This function works just fine """
capital_letters = {
u'?': u'A',
u'?': u'B',
u'?': u'V',
u'?': u'G',
u'?': u'D',
u'?': u'E',
u'?': u'E',
u'?': u'Zh',
u'?': u'Z',
u'?': u'I',
u'?': u'Y',
u'?': u'K',
u'?': u'L',
u'?': u'M',
u'?': u'N',
u'?': u'O',
u'?': u'P',
u'?': u'R',
u'?': u'S',
u'?': u'T',
u'?': u'U',
u'?': u'F',
u'?': u'H',
u'?': u'Ts',
u'?': u'Ch',
u'?': u'Sh',
u'?': u'Sch',
u'?': u'',
u'?': u'Y',
u'?': u'',
u'?': u'E',
u'?': u'Yu',
u'?': u'Ya'
}
lower_case_letters = {
u'?': u'a',
u'?': u'b',
u'?': u'v',
u'?': u'g',
u'?': u'd',
u'?': u'e',
u'?': u'e',
u'?': u'zh',
u'?': u'z',
u'?': u'i',
u'?': u'y',
u'?': u'k',
u'?': u'l',
u'?': u'm',
u'?': u'n',
u'?': u'o',
u'?': u'p',
u'?': u'r',
u'?': u's',
u'?': u't',
u'?': u'u',
u'?': u'f',
u'?': u'h',
u'?': u'ts',
u'?': u'ch',
u'?': u'sh',
u'?': u'sch',
u'?': u'',
u'?': u'y',
u'?': u'',
u'?': u'e',
u'?': u'yu',
u'?': u'ya'
}
translit_string = ""
for index, char in enumerate(string):
if char in lower_case_letters.keys():
char = lower_case_letters[char]
elif char in capital_letters.keys():
char = capital_letters[char]
if len(string) > index+1:
if string[index+1] not in lower_case_letters.keys():
char = char.upper()
else:
char = char.upper()
translit_string += char
return translit_string
def translit2(text):
""" This method should be more easy to grasp,
but throws exception:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
"""
symbols = string.maketrans(u"????????????????????????????????????????????????????",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
sequence = {
u'?':'zh',
u'?':'ts',
u'?':'ch',
u'?':'sh',
u'?':'sch',
u'?':'ju',
u'?':'ja',
u'?':'Zh',
u'?':'Ts',
u'?':'Ch'
}
for char in sequence.keys():
text = text.replace(char, sequence[char])
return text.translate(symbols)
if __name__ == "__main__":
print translit1(u"??????") # prints Privet as expected
print translit2(u"??????") # throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
原始痕迹:
Traceback (most recent call last):
File "translit_error.py", line 124, in <module>
print translit2(u"??????") # throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
File "translit_error.py", line 103, in translit2
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
我的意思是,为什么Python string.maketrans试图使用ascii表呢?为什么英文字母超出0-128范围?
$ python -c "print ord(u'A')"
65
$ python -c "print ord(u'z')"
122
$ python -c "print ord(u\"'\")"
39
Run Code Online (Sandbox Code Playgroud)
几个小时后,我觉得完全没有办法解决这个问题.
有人可以说发生了什么以及如何解决它?
geo*_*org 20
与unicode字符串一起使用时,translate的行为会有所不同.maketrans您必须提供字典而不是表格ord(search)->ord(replace):
symbols = (u"??????????????????????????????????????????????????????????????????",
u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
# for Python 2.*:
# tr = dict( [ (ord(a), ord(b)) for (a, b) in zip(*symbols) ] )
text = u'?????? ???'
print text.translate(tr) # looks good
Run Code Online (Sandbox Code Playgroud)
也就是说,我建议不要重新发明轮子并使用已建立的库:http://pypi.python.org/pypi/Unidecode
Art*_*yan 19
您可以使用音译包(https://pypi.python.org/pypi/transliterate)
示例#1:
from transliterate import translit
print translit("Lorem ipsum dolor sit amet", "ru")
# ????? ????? ????? ??? ????
Run Code Online (Sandbox Code Playgroud)
示例#2:
print translit(u"????? ????? ????? ??? ????", "ru", reversed=True)
# Lorem ipsum dolor sit amet
Run Code Online (Sandbox Code Playgroud)
小智 10
查看CyrTranslit包,它专门用于音译西里尔文脚本文本。它目前支持塞尔维亚语、黑山语、马其顿语和俄语。
用法示例:
>>> import cyrtranslit
>>> cyrtranslit.supported()
['me', 'sr', 'mk', 'ru']
>>> cyrtranslit.to_latin('??? ????? ?? ????????? ??????? ????? ?????', 'ru')
'Moyo sudno na vozdushnoj podushke polno ugrej'
>>> cyrtranslit.to_cyrillic('Moyo sudno na vozdushnoj podushke polno ugrej')
'??? ????? ?? ????????? ??????? ????? ?????'
Run Code Online (Sandbox Code Playgroud)