限制/限制GRequests中的HTTP请求速率

Bar*_*wek 23 python throttling http rate-limiting python-requests

我正在使用GRequests和lxml 在Python 2.7.3中编写一个小脚本,这将允许我从各个网站收集一些可收集的卡片价格并进行比较.问题是其中一个网站限制了请求数量,如果我超过它,则发回HTTP错误429.

有没有办法在GRequestes中添加限制请求的数量,这样我就不会超过我指定的每秒请求数?另外 - 如果发生HTTP 429,我怎样才能使GRequestes在一段时间后重试?

在旁注 - 他们的限制是非常低的.每15秒就有8个请求.我在浏览器中多次破坏它只是刷新页面等待价格变化.

Bar*_*wek 27

要回答我自己的问题,因为我必须自己解决这个问题,而且似乎很少有关于此问题的信息.

这个想法如下.与GRequests一起使用的每个请求对象都可以在创建时将会话对象作为参数.另一方面,会话对象可以安装在发出请求时使用的HTTP适配器.通过创建我们自己的适配器,我们可以拦截请求并以我们最适合我们应用程序的方式对它们进行速率限制.在我的情况下,我最终得到了以下代码.

用于限制的对象:

DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5)
DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15)


class BurstThrottle(object):
    max_hits = None
    hits = None
    burst_window = None
    total_window = None
    timestamp = None

    def __init__(self, max_hits, burst_window, wait_window):
        self.max_hits = max_hits
        self.hits = 0
        self.burst_window = burst_window
        self.total_window = burst_window + wait_window
        self.timestamp = datetime.datetime.min

    def throttle(self):
        now = datetime.datetime.utcnow()
        if now < self.timestamp + self.total_window:
            if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits):
                self.hits += 1
                return datetime.timedelta(0)
            else:
                return self.timestamp + self.total_window - now
        else:
            self.timestamp = now
            self.hits = 1
            return datetime.timedelta(0)
Run Code Online (Sandbox Code Playgroud)

HTTP适配器:

class MyHttpAdapter(requests.adapters.HTTPAdapter):
    throttle = None

    def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE,
                 pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES,
                 pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW,
                 wait_window=DEFAULT_WAIT_WINDOW):
        self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window)
        super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize,
                                            max_retries=max_retries, pool_block=pool_block)

    def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None):
        request_successful = False
        response = None
        while not request_successful:
            wait_time = self.throttle.throttle()
            while wait_time > datetime.timedelta(0):
                gevent.sleep(wait_time.total_seconds(), ref=True)
                wait_time = self.throttle.throttle()

            response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout,
                                                       verify=verify, cert=cert, proxies=proxies)

            if response.status_code != 429:
                request_successful = True

        return response
Run Code Online (Sandbox Code Playgroud)

建立:

requests_adapter = adapter.MyHttpAdapter(
    pool_connections=__CONCURRENT_LIMIT__,
    pool_maxsize=__CONCURRENT_LIMIT__,
    max_retries=0,
    pool_block=False,
    burst_window=datetime.timedelta(seconds=5),
    wait_window=datetime.timedelta(seconds=20))

requests_session = requests.session()
requests_session.mount('http://', requests_adapter)
requests_session.mount('https://', requests_adapter)

unsent_requests = (grequests.get(url,
                                 hooks={'response': handle_response},
                                 session=requests_session) for url in urls)
grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__)
Run Code Online (Sandbox Code Playgroud)


se7*_*7en 9

看看这个自动请求限制:https: //pypi.python.org/pypi/RequestsThrottler/0.2.2

您可以在每个请求之间设置固定数量的延迟,或者在固定的秒数内设置要发送的请求数(这基本上是相同的):

import requests
from requests_throttler import BaseThrottler

request = requests.Request(method='GET', url='http://www.google.com')
reqs = [request for i in range(0, 5)]  # An example list of requests
with BaseThrottler(name='base-throttler', delay=1.5) as bt:
    throttled_requests = bt.multi_submit(reqs)
Run Code Online (Sandbox Code Playgroud)

其中函数multi_submit返回一个列表ThrottledRequest(参见文档末尾的链接).

然后,您可以访问响应:

for tr in throttled_requests:
    print tr.response
Run Code Online (Sandbox Code Playgroud)

或者,您可以通过指定在固定时间内发送的数量或请求(例如,每60秒15个请求)来实现相同的目标:

import requests
from requests_throttler import BaseThrottler

request = requests.Request(method='GET', url='http://www.google.com')
reqs = [request for i in range(0, 5)]  # An example list of requests
with BaseThrottler(name='base-throttler', reqs_over_time=(15, 60)) as bt:
    throttled_requests = bt.multi_submit(reqs)
Run Code Online (Sandbox Code Playgroud)

两种解决方案都可以在不使用with语句的情况下实现:

import requests
from requests_throttler import BaseThrottler

request = requests.Request(method='GET', url='http://www.google.com')
reqs = [request for i in range(0, 5)]  # An example list of requests
bt = BaseThrottler(name='base-throttler', delay=1.5)
bt.start()
throttled_requests = bt.multi_submit(reqs)
bt.shutdown()
Run Code Online (Sandbox Code Playgroud)

有关更多详细信息,请访问:http://pythonhosted.org/RequestsThrottler/index.html