Python脚本在没有下载整个页面的情况下查看是否存在网页?

som*_*me1 16 python httplib urlparse

我正在尝试编写一个脚本来测试网页是否存在,如果不下载整个页面就会检查它会很好.

这是我的跳跃点,我已经看到多个示例以相同的方式使用httplib,但是,我检查的每个站点都返回false.

import httplib
from httplib import HTTP
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    h = HTTP(p[1])
    h.putrequest('HEAD', p[2])
    h.endheaders()
    return h.getreply()[0] == httplib.OK

if __name__=="__main__":
    print checkUrl("http://www.stackoverflow.com") # True
    print checkUrl("http://stackoverflow.com/notarealpage.html") # False
Run Code Online (Sandbox Code Playgroud)

有任何想法吗?

编辑

有人建议这个,但他们的帖子被删除.. urllib2是否避免下载整个页面?

import urllib2

try:
    urllib2.urlopen(some_url)
    return True
except urllib2.URLError:
    return False
Run Code Online (Sandbox Code Playgroud)

Cor*_*erg 22

这个怎么样:

import httplib
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    conn = httplib.HTTPConnection(p.netloc)
    conn.request('HEAD', p.path)
    resp = conn.getresponse()
    return resp.status < 400

if __name__ == '__main__':
    print checkUrl('http://www.stackoverflow.com') # True
    print checkUrl('http://stackoverflow.com/notarealpage.html') # False
Run Code Online (Sandbox Code Playgroud)

如果响应状态代码<400,这将发送HTTP HEAD请求并返回True.

  • 请注意,StackOverflow的根路径返回重定向(301),而不是200 OK.

  • 必须对python3进行更改。将urllib.parse导入为urlparse并导入httplib2。代替HTTPConnection的是HTTPConnectionWithTimeout。代替urlparse,而是urlparse.urlparse。 (3认同)

Max*_*Noe 11

使用requests,这很简单:

import requests

ret = requests.head('http://www.example.com')
print(ret.status_code)
Run Code Online (Sandbox Code Playgroud)

这只是加载网站的标题.要测试这是否成功,您可以检查结果status_code.或者使用如果连接不成功则raise_for_status引发的方法Exception.


小智 5

这个怎么样。

import requests

def url_check(url):
    #Description

    """Boolean return - check to see if the site exists.
       This function takes a url as input and then it requests the site 
       head - not the full html and then it checks the response to see if 
       it's less than 400. If it is less than 400 it will return TRUE 
       else it will return False.
    """
    try:
            site_ping = requests.head(url)
            if site_ping.status_code < 400:
                #  To view the return status code, type this   :   **print(site.ping.status_code)** 
                return True
            else:
                return False
    except Exception:
        return False
Run Code Online (Sandbox Code Playgroud)