5w0*_*1sh 5 python content-type http-headers python-requests
我有一个简单的网站爬虫,它工作正常,但有时它因为大型内容,如ISO图像,.exe文件和其他大的东西而卡住.使用文件扩展名猜测内容类型可能不是最好的主意.
是否可以在不获取整个内容/页面的情况下获取内容类型和内容长度/大小?
这是我的代码:
requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
session.max_redirects = self.maxRedirects
else:
session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
currentUrl = response.url
currentUrlData = urlparse.urlparse(currentUrl)
currentUrlDomain = currentUrlData.netloc
domainWWW = 'www.' + str(urlDomain)
headers = response.headers
contentType = str(headers['content-type'])
except:
logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
logging.exception("Get page exception:")
response = None
Run Code Online (Sandbox Code Playgroud)
是.
您可以使用该Session.head
方法创建HEAD
请求:
response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']
Run Code Online (Sandbox Code Playgroud)
甲HEAD
类似的请求GET
请求,只是消息体将不被发送.
以下是维基百科的引用:
HEAD要求响应与对应于GET请求的响应相同,但没有响应主体.这对于检索在响应头中编写的元信息非常有用,而无需传输整个内容.
requests.head()
为此使用。它不会返回消息正文。您应该使用head
,如果你只在感兴趣的方法headers
。查看此链接了解详细信息。
h = requests.head(some_link)
header = h.headers
content_type = header.get('content-type')
Run Code Online (Sandbox Code Playgroud)
抱歉,我的错误,我应该更好地阅读文档。这是答案:http : //docs.python-requests.org/en/latest/user/advanced/#advanced(正文内容工作流程)
tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)
if int(r.headers['content-length']) > TOO_LONG:
r.connection.close()
# log request too long
Run Code Online (Sandbox Code Playgroud)