BeautifulSoup findall with class attribute- unicode encode error

Question

BeautifulSoup findall with class attribute- unicode encode error

我使用BeautifulSoup从黑客新闻中提取新闻报道(只是标题)并且到目前为止还有这么多 -

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

Run Code Online (Sandbox Code Playgroud)

)

但是,当我运行代码时,它会出现错误 -

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

我如何让它工作？

Answer 1

小智 6

因为BeautifulSoup在内部使用unicode字符串.将unicode字符串打印到控制台将导致Python尝试将unicode转换为Python的默认编码,通常是ascii.对于非ascii网站,这通常会失败.您可以通过Google搜索"python + unicode"来学习有关Python和Unicode的基础知识.同时使用将你的unicode字符串转换为utf-8

print some_unicode_string.decode('utf-8')

Run Code Online (Sandbox Code Playgroud)

你想要`.encode('utf-8')`从Unicode字符串转换为UTF-8编码的字符串. (3认同)

归档时间：	14 年，9 月前
查看次数：	18067 次
最近记录：	14 年，9 月前