Python - 检查请求是否收到整页

Hyp*_*ion 1 python beautifulsoup python-requests

我在脚本中使用此函数来请求网页的 BeautifoulSoup 对象:

def getSoup(url):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
    }
    i = 0
    while i == 0:
        print '(%s) (INFO) Connecting to: %s ...' % (getTime(), url)
        data = requests.get(url, headers=headers).text
        soup = BeautifulSoup(data, 'lxml')
        if soup == None:
            print '(%s) (WARN) Received \'None\' BeautifulSoup object, retrying in 5 seconds ...' % getTime()
            time.sleep(5)
        else:
            i = 1
    return soup
Run Code Online (Sandbox Code Playgroud)

这个循环直到我收到一个有效的 BeautifulSoup 对象,但我想我也可以收到一个不完整的网页,但仍然有一个有效的 BeautifulSoup 对象。我想使用类似的东西:

if '</hml>' in str(data):
    #the page is completly loaded
Run Code Online (Sandbox Code Playgroud)

但我不知道以这种方式使用它是否安全。有没有一种安全的方法来检查页面是否已使用 requests 或 BeautifulSoup 正确下载?

Jay*_*son 5

一种方法是检查请求的状态代码并查看您是否收到部分内容响应 (206)。此处列出了标准 HTTP 响应及其定义的列表

response = requests.get(url, headers=headers)
if response.status_code == 200:
    soup = BeautifulSoup(response.data + partial_data, 'lxml')
    partial_data = None
    if soup == None:
        print '(%s) (WARN) Received \'None\' BeautifulSoup object, retrying in 5 seconds ...' % getTime()
        time.sleep(5)
elif reponse.status_code == 206:
    # store partial data here
    partial_data += response.data
Run Code Online (Sandbox Code Playgroud)