asyncio中的信号量/多个池锁用于1个代理 - aiohttp

mic*_*al 3 python python-3.x python-asyncio aiohttp

我有5,00,000个网址.并希望异步获得每个响应.

import aiohttp
import asyncio    

@asyncio.coroutine
def worker(url):
    response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
    body = yield from response.read_and_close()

    print(url)

def main():
    url_list = [] # lacs of urls, extracting from a file

    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))

main()
Run Code Online (Sandbox Code Playgroud)

我一次想要200个连接(并发200个),而不是因为这个

当我运行这个程序50个网址它工作正常,url_list[:50] 但但如果我通过整个列表,我得到这个错误

aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
Run Code Online (Sandbox Code Playgroud)

可能是频率太高,服务器拒绝在限制后做出响应?

Uni*_*t03 6

是的,人们可以指望服务器在导致过多流量(无论"流量太大"的定义)之后停止响应.

在这种情况下限制并发请求数量(限制它们)的一种方法是使用asyncio.Semaphore,类似于多线程中使用的那些:就像在那里一样,你创建一个信号量并确保你想要限制的操作是在获取信号量先验做实际工作并在事后发布.

为方便起见,asyncio.Semaphore实现上下文管理器使其更容易.

最基本的方法:

CONCURRENT_REQUESTS = 200


@asyncio.coroutine
def worker(url, semaphore):
    # Aquiring/releasing semaphore using context manager.
    with (yield from semaphore):
        response = yield from aiohttp.request(
            'GET',
            url,
            connector=aiohttp.TCPConnector(share_cookies=True,
                                           verify_ssl=False))
        body = yield from response.read_and_close()

        print(url)


def main():
    url_list = [] # lacs of urls, extracting from a file

    semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))    
Run Code Online (Sandbox Code Playgroud)