Tas*_*sos 7 python encoding beautifulsoup
我正在尝试抓取页面,但我有一个UnicodeDecodeError.这是我的代码:
def soup_def(link):
req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"})
usock = urllib2.urlopen(req)
encoding = usock.headers.getparam('charset')
page = usock.read().decode(encoding)
usock.close()
soup = BeautifulSoup(page)
return soup
soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html")
Run Code Online (Sandbox Code Playgroud)
而错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte
Run Code Online (Sandbox Code Playgroud)
我检查了几个用户有相同的错误,但我无法找到任何解决方案.
这是我从维基百科得到的有关字符的信息0xff,它是 UTF-16 的符号。
UTF-16[edit]\nIn UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.\nIf the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters \xc3\xbe\xc3\xbf in a text display that expects the text to be ISO-8859-1.\nif the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters \xc3\xbf\xc3\xbe in a text display that expects the text to be ISO-8859-1.\nPrograms expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).\nRun Code Online (Sandbox Code Playgroud)\n\n所以我在这里有两个想法:
\n\n(1) 可能是由于应将其视为而utf-16不是的原因utf-8
(2) 发生错误是因为您试图将整个汤打印到屏幕上。然后它涉及到你的IDE(Eclipse/Pycharm)是否足够智能来显示这些unicode。
\n\n如果我是你,我会尝试继续前进,而不打印整个汤,只收集你想要的部分。看看您在达到该步骤时遇到问题。如果那里没有问题,那为什么不能将整个汤打印到屏幕上呢?
\n\n如果您确实想将汤打印到屏幕上,请尝试:
\n\nprint soup.prettify(encoding=\'utf-16\')\nRun Code Online (Sandbox Code Playgroud)\n