Python - BeautifulSoup html解析处理gbk编码很差 - 中文webscraping问题

Whe*_*ton 2 python unicode beautifulsoup unicode-string web-scraping

我一直在修补以下脚本:

#    -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDammit
import urllib2,sys
import time
try:
    import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
    timeoutsocket.setDefaultSocketTimeout(10)
except ImportError:
    pass

h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f'

address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
soup=BeautifulSoup(address)

p=soup.findAll('p')
t=p[2].string[:10]
Run Code Online (Sandbox Code Playgroud)

具有以下输出:

打印

¡¡¡¡戴¢我

打印h

信息通

Ť

U '\ XA1\XA1\XA1\XA1\XD0\XC5\XCF\XA2\XCD\xa8'

H

U '\ U3000\U3000\u4fe1\u606f\u901a'

h.encode( 'GBK')

'\ XA1\XA1\XA1\XA1\XD0\XC5\XCF\XA2\XCD\xa8'

简单地说:当我通过BeautifulSoup传递这个html时,它采用gbk编码的文本并认为它是unicode,而不是认识到它需要先解码.然而,"h"和"t"应该是相同的,因为h只是我从html文件中获取文本并手动转换它.

我该如何解决这个问题?

最好

惠顿

Sim*_*onJ 5

该文件的元标记声称字符集是GB2312,但数据包含较新的GBK/GB18030中的字符,这就是使BeautifulSoup绊倒的原因:

simon@lucifer:~$ python
Python 2.7 (r27:82508, Jul  3 2010, 21:12:11) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> data = urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
>>> data.decode("gb2312")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 20148-20149: illegal multibyte sequence

在这一点上,UnicodeDammit纾困,尝试chardet,UTF-8,最后是Windows-1252,它总是成功 - 这就是你所看到的,它的外观.

如果我们告诉解码器用'?'替换无法识别的字符,我们可以看到GB2312中缺少的字符:

>>> print data[20140:20160].decode("gb2312", "replace")
??????????

使用正确的编码:

>>> print data[20140:20160].decode("gb18030", "replace")
??????????
>>> from BeautifulSoup import BeautifulSoup
>>> s = BeautifulSoup(data, fromEncoding="gb18030")
>>> print s.findAll("p")[2].string[:10]
?????????&

也:

>>> print s.findAll("p")[2].string
?????????“???”?????????????????????
???????GDP???????????????????????????????
????????????????????????????????????????
????