BeautifulSoup更改HTML

Irm*_*nis 3 python beautifulsoup python-requests

我注意到,当我从网上获取带有Beautiful Soup的HTML时,它会以某种方式发生变化.这是我用来获取它的代码:

from bs4 import BeautifulSoup
import requests
url ="http://www.basketnews.lt/lygos/59-nacionaline-krepsinio-asociacija/2013/naujienos.html"
r = requests.get(url)
soup = BeautifulSoup(r.text)
print soup
Run Code Online (Sandbox Code Playgroud)

这是原始HTML的一部分:

<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valan?i?nui ir Raptors sezonas baigtas <span class="title_description">(foto, statistika)</span></a>`
Run Code Online (Sandbox Code Playgroud)

以下是与Beautiful Soup一起使用的HTML的相同部分:

<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">ValanÄiÅ«nui ir âRaptorsâ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>
Run Code Online (Sandbox Code Playgroud)

您将看到我正在解析的HTML中的文本是如何混乱的.问题出在哪儿?

Mar*_*ers 8

您正在使用r.text,这意味着requests将使用默认编码; 然而它错了:

>>> r = requests.get(url)
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
Run Code Online (Sandbox Code Playgroud)

ISO-8859-1(Latin 1)是响应HTTP 1.1默认编码text/.

使用检测算法时,会找到UTF-8.

您不应该使用r.text而是使用r.content,将其留给BeautifulSoup进行检测:

soup = BeautifulSoup(r.content)
Run Code Online (Sandbox Code Playgroud)

现在它正常工作:

>>> soup = BeautifulSoup(r.content)
>>> soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html')
<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valan?i?nui ir „Raptors“ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>
>>> print soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html').text
Valan?i?nui ir „Raptors“ sezonas baigtas (foto, statistika)
Run Code Online (Sandbox Code Playgroud)

BeautifulSoup 使用自动检测,但在这种情况下,它会在页面中找到<meta>具有正确编码的标题:

>>> soup.find('meta', {'http-equiv': 'content-type'})
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
Run Code Online (Sandbox Code Playgroud)