使用urllib2或任何其他http库读取超时

Question

使用urllib2或任何其他http库读取超时

Bjö*_*ist 25 python sockets timeout http nonblocking

我有读取这样的网址的代码:

from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
    req.add_header(key, val)
res = urlopen(req, timeout = timeout)
# This line blocks
content = res.read()

Run Code Online (Sandbox Code Playgroud)

超时适用于urlopen()调用.但是然后代码进入res.read()调用,我想要读取响应数据,并且不会在那里应用超时.因此,读取调用可能几乎永远挂起,等待来自服务器的数据.我发现的唯一解决方案是使用一个信号来中断read(),因为我正在使用线程,所以不适合我.

还有哪些其他选择？是否有用于处理读取超时的Python的HTTP库？我看过httplib2和请求,他们似乎遇到了与上面相同的问题.我不想使用套接字模块编写自己的非阻塞网络代码,因为我认为应该已经有了一个库.

更新:以下解决方案都没有为我做.您可以自己查看设置套接字或urlopen超时在下载大文件时无效:

from urllib2 import urlopen
url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso'
c = urlopen(url)
c.read()

Run Code Online (Sandbox Code Playgroud)

至少在使用Python 2.7.3的Windows上,超时被完全忽略.

Answer 1

小智 6

如果没有通过线程或其他方式使用某种异步计时器,任何库都不可能这样做.其原因是,在timeout用于参数httplib,urllib2以及其他库设置timeout于底层socket.文档中解释了实际操作的内容.

SO_RCVTIMEO

设置超时值,该值指定输入函数在完成之前等待的最长时间.它接受一个timeval结构,其中包含秒数和微秒数,指定等待输入操作完成的时间限制.如果接收操作在没有接收到额外数据的情况下被阻塞了这么长时间,则如果没有收到数据,它将返回部分计数或errno设置为[EAGAIN]或[EWOULDBLOCK].

粗体部分是关键.socket.timeout如果在timeout窗口期间没有收到单个字节,则仅引发A. 换句话说,这是timeout接收字节之间的差异.

使用的简单功能threading.Timer可以如下.

import httplib
import socket
import threading

def download(host, path, timeout = 10):
    content = None

    http = httplib.HTTPConnection(host)
    http.request('GET', path)
    response = http.getresponse()

    timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
    timer.start()

    try:
        content = response.read()
    except httplib.IncompleteRead:
        pass

    timer.cancel() # cancel on triggered Timer is safe
    http.close()

    return content

>>> host = 'releases.ubuntu.com'
>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
>>> print content is None
True
>>> content = download(host, '/15.04/MD5SUMS', 1)
>>> print content is None
False

Run Code Online (Sandbox Code Playgroud)

除了检查之外None,还可以捕获httplib.IncompleteRead不在函数内部但在其外部的异常.如果HTTP请求没有Content-Length标头,则后一种情况不起作用.

Answer 2

Alf*_*lfe 5

我在测试中发现(使用此处描述的技术),urlopen()调用中设置的超时也会影响read()调用:

import urllib2 as u
c = u.urlopen('http://localhost/', timeout=5.0)
s = c.read(1<<20)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/httplib.py", line 561, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/httplib.py", line 1298, in read
    return s + self._file.read(amt - len(s))
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
socket.timeout: timed out

Run Code Online (Sandbox Code Playgroud)

也许这是新版本的功能？我在开箱即用的12.04 Ubuntu上使用Python 2.7.

它可能会触发单个`.recv()`调用的超时(可能返回部分数据),但[它不会限制总读取超时(直到EOF)](http://stackoverflow.com/a/32684677/4279 ). (3认同)

Answer 3

kol*_*nko 5

我希望这是一个常见问题，但是 - 在任何地方都找不到答案......只是使用超时信号为此构建了一个解决方案：

import urllib2
import socket

timeout = 10
socket.setdefaulttimeout(timeout)

import time
import signal

def timeout_catcher(signum, _):
    raise urllib2.URLError("Read timeout")

signal.signal(signal.SIGALRM, timeout_catcher)

def safe_read(url, timeout_time):
    signal.setitimer(signal.ITIMER_REAL, timeout_time)
    url = 'http://uberdns.eu'
    content = urllib2.urlopen(url, timeout=timeout_time).read()
    signal.setitimer(signal.ITIMER_REAL, 0)
    # you should also catch any exceptions going out of urlopen here,
    # set the timer to 0, and pass the exceptions on.

Run Code Online (Sandbox Code Playgroud)

顺便说一句，解决方案信号部分的功劳在这里：python计时器之谜

Answer 4

Chr*_*isP 0

这不是我看到的行为。URLError当呼叫超时时我得到：

from urllib2 import Request, urlopen
req = Request('http://www.google.com')
res = urlopen(req,timeout=0.000001)
#  Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  ...
#  raise URLError(err)
#  urllib2.URLError: <urlopen error timed out>

Run Code Online (Sandbox Code Playgroud)

你不能捕获这个错误然后避免尝试阅读res吗？当我尝试使用res.read()此后，我得到NameError: name 'res' is not defined. 你需要的是这样的东西：

try:
    res = urlopen(req,timeout=3.0)
except:           
    print 'Doh!'
finally:
    print 'yay!'
    print res.read()

Run Code Online (Sandbox Code Playgroud)

我想手动实现超时的方法是通过multiprocessing，不是吗？如果作业尚未完成，您可以终止它。

我认为你误会了。urlopen() 调用成功连接到服务器，但随后程序在 read() 调用处挂起，因为服务器返回数据速度缓慢。这就是需要超时的地方。 (5认同)

归档时间：	13 年，6 月前
查看次数：	13763 次
最近记录：	9 年，9 月前