Joe*_*oel 3 python google-app-engine
我正在使用urlfetch来获取URL.当我尝试将其发送到html2text函数(剥离所有HTML标记)时,我收到以下消息:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
Run Code Online (Sandbox Code Playgroud)
我一直在尝试处理字符串上的编码('UTF-8','忽略'),但我一直收到这个错误.
有任何想法吗?
谢谢,
乔尔
一些代码:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
Run Code Online (Sandbox Code Playgroud)
并且错误消息:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
Run Code Online (Sandbox Code Playgroud)
您需要首先解码您获取的数据!用哪个编解码器?取决于您获取的网站.
当你有unicode并尝试对其进行编码时,some_unicode.encode('utf-8', 'ignore')我无法想象它是如何引发错误的.
好的,你需要做什么:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
Run Code Online (Sandbox Code Playgroud)
这不是很强大,但它应该告诉你的方式.
| 归档时间: |
|
| 查看次数: |
2660 次 |
| 最近记录: |