urlopen,BeautifulSoup和UTF-8问题

Question

urlopen,BeautifulSoup和UTF-8问题

Rya*_*rio 2 python urllib2 beautifulsoup utf-8

我只是想要检索一个网页,但不知何故,HTML文件中嵌入了一个外来字符.使用"查看源"时,此字符不可见.

isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page) 
html #This line causes error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

我也试过......

html = BeautifulSoup(page.encode('utf-8'))

Run Code Online (Sandbox Code Playgroud)

如何在不收到此错误的情况下将此网页读入BeautifulSoup？

Answer 1

Tri*_*ych 11

当您尝试打印 BeautifulSoup文件的表示时,实际上可能会发生此错误,如果我怀疑您正在交互式控制台中工作,那么这将自动发生.

# This code will work fine, note we are assigning the result 
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')

# This will probably show the error you saw
print soup

# And this would probably be fine
print soup.encode('utf-8')

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，2 月前
查看次数：	9786 次
最近记录：	9 年，10 月前