并行自动化时如何解决 Selenium 中使用的 urllib3.connectionpool 问题?

xba*_*laj 6 python selenium multithreading urllib selenium-webdriver

快速描述

我正在使用 selenium 顺序处理许多页面,但为了提高性能,我决定并行化处理 - 将页面拆分到更多线程之间(这是可以完成的,因为页面彼此独立)。

这是简化的代码:

def process_page(driver, page, lock):
    driver.get("page.url()")
    driver.find_element_by_css_selector("some selector")
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "some selector")))
    .
    .
    .
    with lock:
        for i in range(result_tuple.__len__()):
            logger.info(result_tuple[i])
    return result_tuple

def process_all_pages():
    def pages_processing(id, lock):
        result = []
        with MyWebDriver(webdriver_options) as driver:
            for i in range(50):
                result.append(process_page(driver, pages[id * 50 + i], lock))
        return result

    lock = threading.Lock()

    with ThreadPoolExecutor(4) as executor:
        futures = []
        for i in range(4):
            futures.append(executor.submit(pages_processing, i, lock))

        result = []
        for i in range(futures.__len__()):
            result.append(futures[i].result())

    return result
Run Code Online (Sandbox Code Playgroud)

MyWebDriver只是 Chrome 驱动程序的一个简单的上下文管理器,当进入上下文时,它会生成一个新的 Chrome 驱动程序实例,当它退出上下文时,它会退出给定的 Chrome 实例。

这段代码为每个线程分别生成 4 个 Chrome 驱动程序,并使一些 selenium 在 Chrome 驱动程序中工作,每个线程也分别工作。

问题

在最初的几秒钟内,它的工作方式就像一个魅力,但一段时间后,记录器中开始出现警告,并且 Selenium 似乎停止与 Chrome 驱动程序通信。

  • 任何数量的线程都会出现相同的行为,除非它在单个线程上运行。
  • 在 Windows 或 Ubuntu 上运行的行为相同

如果需要,我还可以提供调试日志,但不确定是否有相关内容。

记录器中的警告:

...
# With these first warnings selenium stops to communicate with some Chrome drivers - just nothing happens in some of them.
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
...
# These warnings come a bit later
WARNING - urllib3.connectionpool - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018343AB24A8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348854E10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348869710>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
...
Run Code Online (Sandbox Code Playgroud)

经过测试的解决方法

我尝试过这些补丁来设置更高的 maxsize (HTTPConnectionPoolHTTPSConnectionPool) - /sf/answers/1557755951/ - 顺便说一句,这并没有解决问题。补丁已被执行。

接下来,我尝试在PoolManager类中设置更高的 num_pools - 我仅在源中更改了这一点,并且还更改了HTTPConnectionPoolHTTPSConnectionPool中的 maxsize 。这实际上解决了一个问题 - 日志中没有警告,但与驱动程序的硒通信仍然冻结。