我有html2text模块的问题...显示我UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
示例:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
Run Code Online (Sandbox Code Playgroud)
...也尝试h.handle( unicode( html, "utf-8" ) 过没有成功.任何帮助.编辑:
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
这个问题在不解码时很容易重现,但在正确解码源码时效果很好.你还当你的错误重复使用的解析器!
您可以使用已知良好的Unicode源代码来尝试此操作,例如http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
如果您没有解码响应unicode,库将失败:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
现在,如果重用该HTML2Text对象,其状态不会被清除,它仍然保存不正确的数据,因此即使传入Unicode也会失败:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
你需要使用一个新的对象,它会工作得很好:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3701 次 |
| 最近记录: |