roc*_*wse 7 python html5 parsing
我试图看看HTML5应用程序是如何工作的,任何在webkit浏览器中保存页面的尝试(chrome,Safari)都包含一些但不是所有的cache.manifest资源.是否有一个库或一组代码将解析cache.manifest文件,并下载所有资源(图像,脚本,CSS)?
(原始代码移动回答... noob错误>.<)
我最初将其作为问题的一部分发布...(没有新手 stackoverflow 海报曾经这样做过;)
因为根本没有答案。干得好:
我能够想出以下 python 脚本来执行此操作,但任何输入都会受到赞赏 =) (这是我第一次尝试 python 代码,所以可能有更好的方法)
import os
import urllib2
import urllib
cmServerURL = 'http://<serverURL>:<port>/<path-to-cache.manifest>'
# download file code taken from stackoverflow
# http://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python
def loadURL(url, dirToSave):
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(dirToSave, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
# download the cache.manifest file
# since this request doesn't include the Conent-Length header we will use a different api =P
urllib.urlretrieve (cmServerURL+ 'cache.manifest', './cache.manifest')
# open the cache.manifest and go through line-by-line checking for the existance of files
f = open('cache.manifest', 'r')
for line in f:
filepath = line.split('/')
if len(filepath) > 1:
fileName = line.strip()
# if the file doesn't exist, lets download it
if not os.path.exists(fileName):
print 'NOT FOUND: ' + line
dirName = os.path.dirname(fileName)
print 'checking dirctory: ' + dirName
if not os.path.exists(dirName):
os.makedirs(dirName)
else:
print 'directory exists'
print 'downloading file: ' + cmServerURL + line,
loadURL (cmServerURL+fileName, fileName)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
505 次 |
| 最近记录: |