检测网页是否已更改

Rov*_*Dar 8 python screen-scraping if-modified-since web

在我的python应用程序中,我必须阅读许多网页来收集数据.为了减少http调用,我想只获取更改的页面.我的问题是我的代码总是告诉我页面已被更改(代码200)但实际上并非如此.

这是我的代码:

from models import mytab
import re
import urllib2
from wsgiref.handlers import format_date_time
from datetime import datetime
from time import mktime

def url_change():
    urls = mytab.objects.all()
    # this is some urls:
    # http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews
    # http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel
    # http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews
    # http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/
    # http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews
    # ...

    for url in urls:
        request = urllib2.Request(url.url)
        if url.last_date == None:
            now = datetime.now()
            stamp = mktime(now.timetuple())
            url.last_date = format_date_time(stamp)
            url.save()

        request.add_header("If-Modified-Since", url.last_date)

        try:
            response = urllib2.urlopen(request) # Make the request
            # some actions
            now = datetime.now()
            stamp = mktime(now.timetuple())
            url.last_date = format_date_time(stamp)
            url.save()
        except urllib2.HTTPError, err:
            if err.code == 304:
                print "nothing...."
            else:
                print "Error code:", err.code 
                pass
Run Code Online (Sandbox Code Playgroud)

我不明白出了什么问题.谁能帮我?

Jon*_*sco 5

发送"If-Modified-Since"标头时,Web服务器不需要发送304标头作为响应.他们可以自由发送HTTP 200并再次发送整个页面.

发送"If-Modified-Since"或"If-None-Since"会向服务器发出警报,提示您希望缓存响应(如果可用).这就像发送'Accept-Encoding:gzip,deflate'标题 - 你只是告诉服务器你会接受一些东西,而不是要求它.

  • 最简单的方法是使用MD5哈希指纹每个,并将其存储在本地进行比较.但问题是,虽然"主要"内容没有变化,但"辅助"内容已经改变 - 不同的广告标签,"推广故事","推荐链接","合作伙伴链接"等.甚至是时间戳该页面将抛弃md5. (3认同)