使用带有UTF-8的soup.get_text()

Question

使用带有UTF-8的soup.get_text()

And*_*rey 5 python beautifulsoup python-2.7

我需要使用BeautifulSoup从页面获取所有文本.在BeautifulSoup的文档中,它表明你可以soup.get_text()做到这一点.当我在reddit.com上尝试这样做时,我收到了这个错误:


UnicodeEncodeError in soup.py:16
  'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence

Run Code Online (Sandbox Code Playgroud)

我在我检查的大多数网站上都遇到了类似的错误.
我也做了类似的错误soup.prettify(),但是通过改变它来修复它soup.prettify('UTF-8').有没有什么办法解决这一问题？提前致谢!

6月24日更新
我发现了一些似乎适用于其他人的代码,但我仍然需要使用UTF-8而不是默认代码.码:


texts = soup.findAll(text=True)

   def visible(element):
      if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
         return False
      elif re.match('', str(element)): return False
      elif re.match('\n', str(element)): return False
      return True

   visible_texts = filter(visible, texts)

   print visible_texts

Run Code Online (Sandbox Code Playgroud)

但错误是不同的.进展？


UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)

Run Code Online (Sandbox Code Playgroud)

Answer 1

C0d*_*ker 0

如果您可能在页面上处理 unicode，则不能执行 str(text)。使用 unicode() 代替 str()。

归档时间：	13 年，7 月前
查看次数：	5690 次
最近记录：	11 年，5 月前