使用 aiohttp 的 HEAD 请求很慢

Question

使用 aiohttp 的 HEAD 请求很慢

tso*_*orn 5 python python-3.x python-asyncio aiohttp

给定 5 万个网站 url 的列表，我的任务是找出其中哪些是可用的/可访问的。这个想法只是向HEAD每个 URL发送一个请求并查看状态响应。从我听到一个异步方法是要去的地方，现在我使用的是asyncio用aiohttp。

我想出了以下代码，但速度非常糟糕。在我的 10 兆位连接上，1000 个 URL 大约需要 200 秒。我不知道期望的速度是多少，但我是 Python 异步编程的新手，所以我想我在某个地方走错了地方。如您所见，我已尝试将允许的同时连接数增加到 1000（从默认值 100 增加）以及 DNS 解析保留在缓存中的持续时间；都没有什么大的影响。环境有 Python 3.6 和 aiohttp3.5.4。

与问题无关的代码审查也受到赞赏。

import asyncio
import time
from socket import gaierror
from typing import List, Tuple

import aiohttp
from aiohttp.client_exceptions import TooManyRedirects

# Using a non-default user-agent seems to avoid lots of 403 (Forbidden) errors
HEADERS = {
    'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/45.0.2454.101 Safari/537.36'),
}


async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
    try:
        # A HEAD request is quicker than a GET request
        resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
        async with resp:
            status = resp.status
            reason = resp.reason
        if status == 405:
            # HEAD request not allowed, fall back on GET
            resp = await session.get(
                url, allow_redirects=True, ssl=False, headers=HEADERS)
            async with resp:
                status = resp.status
                reason = resp.reason
        return (status, reason)
    except aiohttp.InvalidURL as e:
        return (900, str(e))
    except aiohttp.ClientConnectorError:
        return (901, "Unreachable")
    except gaierror as e:
        return (902, str(e))
    except aiohttp.ServerDisconnectedError as e:
        return (903, str(e))
    except aiohttp.ClientOSError as e:
        return (904, str(e))
    except TooManyRedirects as e:
        return (905, str(e))
    except aiohttp.ClientResponseError as e:
        return (906, str(e))
    except aiohttp.ServerTimeoutError:
        return (907, "Connection timeout")
    except asyncio.TimeoutError:
        return (908, "Connection timeout")


async def get_status_codes(loop: asyncio.events.AbstractEventLoop, urls: List[str],
                           timeout: int) -> List[Tuple[int, str]]:
    conn = aiohttp.TCPConnector(limit=1000, ttl_dns_cache=300)
    client_timeout = aiohttp.ClientTimeout(connect=timeout)
    async with aiohttp.ClientSession(
            loop=loop, timeout=client_timeout, connector=conn) as session:
        codes = await asyncio.gather(*(get_status_code(session, url) for url in urls))
        return codes


def poll_urls(urls: List[str], timeout=20) -> List[Tuple[int, str]]:
    """
    :param timeout: in seconds
    """
    print("Started polling")
    time1 = time.time()
    loop = asyncio.get_event_loop()
    codes = loop.run_until_complete(get_status_codes(loop, urls, timeout))
    time2 = time.time()
    dt = time2 - time1
    print(f"Polled {len(urls)} websites in {dt:.1f} seconds "
          f"at {len(urls)/dt:.3f} URLs/sec")
    return codes

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mik*_*mov 5

现在，您正在一次启动所有请求。因此可能瓶颈出现在某处。为了避免这种情况，可以使用信号量：

# code

sem = asyncio.Semaphore(200)


async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
    try:
        async with sem:
            resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
            # code

Run Code Online (Sandbox Code Playgroud)

我通过以下方式对其进行了测试：

poll_urls([
    'http://httpbin.org/delay/1' 
    for _ 
    in range(2000)
])

Run Code Online (Sandbox Code Playgroud)

并得到：

Started polling
Polled 2000 websites in 13.2 seconds at 151.300 URLs/sec

Run Code Online (Sandbox Code Playgroud)

尽管它请求单个主机，但它表明异步方法可以完成这项工作：13 秒。< 2000 秒。

还有几件事可以做：

您应该使用信号量值来为您的具体环境和任务实现更好的性能。
尝试将超时从到降低20，比方说，5 秒：因为你只是在做头部请求，所以不应该花费太多时间。如果请求挂起 5 秒钟，则很有可能它根本不会成功。
在脚本运行时监控系统资源（网络/CPU/RAM）有助于确定瓶颈是否仍然存在。
顺便说一句，您安装了吗aiodns（如文档所示）？
是否禁用了SSL，改变什么？
尝试启用日志记录的调试级别以查看那里是否有任何有用的信息
尝试设置客户端跟踪，尤其是测量每个请求步骤的时间，以查看哪些步骤花费的时间最多

如果没有完全可重现的情况，很难说更多。

归档时间：	6 年，7 月前
查看次数：	1407 次
最近记录：	6 年，7 月前