检查页面是否是python中的HTML页面？

Question

检查页面是否是python中的HTML页面？

我正在尝试在python中为Web爬虫编写代码.我想检查我要抓取的页面是否是HTML页面而不是.pdf/.doc/.docx等页面.我不想将扩展名.html检查为asp,aspx或者像http://bing.com/travel/明确地没有.html扩展名,但它们是html页面.在python中有什么好方法吗？

Answer 1

unu*_*tbu 5

这只从服务器获取标头:

import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)

Run Code Online (Sandbox Code Playgroud)

版画

application/x-bzip2

Run Code Online (Sandbox Code Playgroud)

从中可以得出结论,这不是HTML.你可以用

'html' in content_type

Run Code Online (Sandbox Code Playgroud)

以编程方式测试内容是否为HTML(或可能是XHTML).如果您想更加确定内容是HTML,您可以下载内容并尝试使用HTML解析器(如lxml或BeautifulSoup)进行解析.

小心使用requests.get这样:

import requests
r = requests.get(url)
print(r.headers['content-type'])

Run Code Online (Sandbox Code Playgroud)

这需要很长时间,我的网络监视器显示持续负载,让我相信这是在下载整个文件,而不仅仅是标题.

另一方面,

import requests
r = requests.head(url)
print(r.headers['content-type'])

Run Code Online (Sandbox Code Playgroud)

只获取标题.

归档时间：	12 年，3 月前
查看次数：	1570 次
最近记录：	12 年，3 月前