Irm*_*nis 3 python beautifulsoup python-requests
我注意到,当我从网上获取带有Beautiful Soup的HTML时,它会以某种方式发生变化.这是我用来获取它的代码:
from bs4 import BeautifulSoup
import requests
url ="http://www.basketnews.lt/lygos/59-nacionaline-krepsinio-asociacija/2013/naujienos.html"
r = requests.get(url)
soup = BeautifulSoup(r.text)
print soup
Run Code Online (Sandbox Code Playgroud)
这是原始HTML的一部分:
<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valan?i?nui ir Raptors sezonas baigtas <span class="title_description">(foto, statistika)</span></a>`
Run Code Online (Sandbox Code Playgroud)
以下是与Beautiful Soup一起使用的HTML的相同部分:
<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">ValanÄiÅ«nui ir âRaptorsâ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>
Run Code Online (Sandbox Code Playgroud)
您将看到我正在解析的HTML中的文本是如何混乱的.问题出在哪儿?
您正在使用r.text,这意味着requests将使用默认编码; 然而它错了:
>>> r = requests.get(url)
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
Run Code Online (Sandbox Code Playgroud)
ISO-8859-1(Latin 1)是响应的HTTP 1.1默认编码text/.
使用检测算法时,会找到UTF-8.
您不应该使用r.text而是使用r.content,将其留给BeautifulSoup进行检测:
soup = BeautifulSoup(r.content)
Run Code Online (Sandbox Code Playgroud)
现在它正常工作:
>>> soup = BeautifulSoup(r.content)
>>> soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html')
<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valan?i?nui ir „Raptors“ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>
>>> print soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html').text
Valan?i?nui ir „Raptors“ sezonas baigtas (foto, statistika)
Run Code Online (Sandbox Code Playgroud)
BeautifulSoup 也使用自动检测,但在这种情况下,它会在页面中找到<meta>具有正确编码的标题:
>>> soup.find('meta', {'http-equiv': 'content-type'})
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1395 次 |
| 最近记录: |