这是一种故障保护方法。首先,解析 url 以获取域名和其余部分。
>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
Run Code Online (Sandbox Code Playgroud)
现在,使用上述信息来获取内容类型。使用parse_object.netloc代替 sstatic.net,使用parse_object.path代替硬编码路径。
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
Run Code Online (Sandbox Code Playgroud)
这告诉您这是一个 1150 字节的图像(image/* mime-type)。有足够的信息让您决定是否要获取完整资源。
编辑
对于缩短的网址,例如http://goo.gl/IwruD指向 的http://ubuntu.icafebusiness.com/images/ubuntugui2.jpg,在您收到的响应中,有一个名为 的附加参数'location'。
这就是我要说的:
>>> import httplib
>>> conn = httplib.HTTPConnection("goo.gl")
>>> conn.request("HEAD", "/IwruD")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('x-xss-protection', '1; mode=block'),
('x-content-type-options', 'nosniff'),
('transfer-encoding', 'chunked'),
('age', '64'),
('expires', 'Mon, 01 Jan 1990 00:00:00 GMT'),
('server', 'GSE'),
('location', 'http://ubuntu.icafebusiness.com/images/ubuntugui2.jpg'),
('pragma', 'no-cache'),
('cache-control', 'no-cache, no-store, max-age=0, must-revalidate'),
('date', 'Sat, 30 Jun 2012 08:52:15 GMT'),
('x-frame-options', 'SAMEORIGIN'),
('content-type', 'text/html; charset=UTF-8')]
Run Code Online (Sandbox Code Playgroud)
而在直接网址中,您将找不到它。
>>> import httplib
>>> conn = httplib.HTTPConnection("ubuntu.icafebusiness.com")
>>> conn.request("HEAD", "/images/ubuntugui2.jpg")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '78603'), ('accept-ranges', 'bytes'), ('server', 'Apache'), ('last-modified', 'Sat, 16 Aug 2008 01:36:17 GMT'), ('etag', '"1fb8277-1330b-45489c3ad2640"'), ('date', 'Sat, 30 Jun 2012 08:55:46 GMT'), ('content-type', 'image/jpeg')]
Run Code Online (Sandbox Code Playgroud)
您可以使用简单的代码来查找:
>>> r = res.getheaders()
>>> redirected = False
>>> for e in r:
>>> if(e[0] == 'location'):
>>> redirected = e
>>>
>>> if(redirected != False):
>>> print redirected[1]
'http://ubuntu.icafebusiness.com/images/ubuntugui2.jpg'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2991 次 |
| 最近记录: |