我有一长串域名,我需要生成一些报告.该列表包含一些IDN域,虽然我知道如何在命令行中在python中转换它们:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
Run Code Online (Sandbox Code Playgroud)
我正在努力使用一个小脚本来读取文本文件中的数据.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
Run Code Online (Sandbox Code Playgroud)
我得到以下输出:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
我也尝试过:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Run Code Online (Sandbox Code Playgroud)
哪个给了我:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Run Code Online (Sandbox Code Playgroud)
这是我的测试数据文件:
pfarmer.com
pfarmerü.com
Run Code Online (Sandbox Code Playgroud)
我非常清楚我现在需要了解unicode.
谢谢,
彼得
kni*_*tti 14
你需要知道你保存的文件编码.这可能是'utf-8'(非Unicode)或'iso-8859-1'或'cp1252'等.
然后你可以做(假设'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Run Code Online (Sandbox Code Playgroud)
将编码的字符串转换为unicode decode.将unicode转换为字符串encode.如果你试图对已编码的东西进行编码,python会首先尝试解码,默认编解码器'ascii'对于非ASCII值失败.