Elasticsearch 使用 Python 批量插入 - 套接字超时错误

Stp*_*111 5 python elasticsearch

弹性搜索 7.10.2

Python 3.8.5

弹性搜索-py 7.12.1

我正在尝试使用 elasticsearch-py 批量助手将 100,000 条记录批量插入到 ElasticSearch 中。

这是Python代码:

import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk

     # ES Configuration start
        es_hosts = [
        "http://localhost:9200",]
        es_api_user = 'user'
        es_api_password = 'pw'
        index_name = 'index1'
        chunk_size = 10000
        errors_before_interrupt = 5
        refresh_index_after_insert = False
        max_insert_retries = 3
        yield_ok = False  # if set to False will skip successful documents in the output
    
        # ES Configuration end
        # =======================
    
        filename = file.json
    
        logging.info('Importing data from {}'.format(filename))
    
        es = Elasticsearch(
            es_hosts,
            #http_auth=(es_api_user, es_api_password),
            sniff_on_start=True,  # sniff before doing anything
            sniff_on_connection_fail=True,  # refresh nodes after a node fails to respond
            sniffer_timeout=60,  # and also every 60 seconds
            retry_on_timeout=True,  # should timeout trigger a retry on different node?
        )
    
    
        def data_generator():
            f = open(filename)
            for line in f:
                yield {**json.loads(line), **{
                    "_index": index_name,
                }}
    
    
        errors_count = 0
    
        for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
                                         max_retries=max_insert_retries, yield_ok=yield_ok):
            if ok is not True:
                logging.error('Failed to import data')
                logging.error(str(result))
                errors_count += 1
    
                if errors_count == errors_before_interrupt:
                    logging.fatal('Too many import errors, exiting with error code')
                    exit(1)
                    
        print("Documents loaded to Elasticsearch")
Run Code Online (Sandbox Code Playgroud)

当 json 文件包含少量文档(~100)时,此代码运行没有问题。但我刚刚用一个100k文档的文件进行了测试,我得到了这个错误:

WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
    response = self.pool.urlopen(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)  
Run Code Online (Sandbox Code Playgroud)

我不得不承认这有点超出我的理解范围。我通常不喜欢在此处粘贴大的错误消息,但我不确定该消息的相关内容。

我忍不住想我可能需要调整对象中的一些参数es?或者配置变量?我对参数了解不够,无法自己做出明智的决定。

最后但同样重要的一点 - 看起来仍然有一些文档被加载到 ES 索引中。但更奇怪的是,当 json 文件只有 100k 时,计数显示 110k。

He3*_*xxx 9

长话短说:

将 10000减少chunk_size到默认值 500,我希望它能够工作。如果自动重试会给您带来重复项,您可能需要禁用它。

发生了什么?

创建Elasticsearch对象时,您指定了chunk_size=10000. 这意味着该streaming_bulk调用将尝试插入 10000 个元素的块。与elasticsearch的连接有一个可配置的超时,默认为10秒。因此,如果您的elasticsearch服务器花费超过10秒的时间来处理您想要插入的10000个元素,则会发生超时,并且这将被视为错误。

创建Elasticsearch对象时,您还指定retry_on_timeout为 True 并在streaming_bulk_call您设置的中max_retries=max_insert_retries指定为 3。

这意味着,当发生这样的超时时,库将尝试重新连接 3 次,但是,当插入之后仍然超时时,它会给出您注意到的错误。(文档

另外,当超时发生时,库无法知道文档是否已成功插入,因此必须假设它们没有成功。因此,它将尝试再次插入相同的文档。我不知道您的输入行是什么样子,但如果它们不包含 field _id这会在您的索引中创建重复项。您可能想防止这种情况发生——要么添加某种类型的_id,要么禁用自动重试并手动处理它。

该怎么办?

有两种方法可以解决此问题:

  • 增加timeout
  • 减少chunk_size

streaming_bulk默认chunk_size设置为 500。您的 10000 要高得多。当将此值增加到超过 500 时,我不会期望获得很高的性能增益,因此我建议您在此处仅使用默认值 500。如果 500 仍然因超时而失败,您甚至可能需要进一步减少它。如果您要索引的文档非常复杂,则可能会发生这种情况。

streaming_bulk您还可以增加调用或对象的超时es。要仅为调用更改它streaming_bulk您可以提供request_timeout关键字参数

for ok, result in streaming_bulk(
        es,
        data_generator(),
        chunk_size=chunk_size,
        refresh=refresh_index_after_insert,
        request_timeout=60*3,  # 3 minutes
        yield_ok=yield_ok):
    # handle like you did
    pass

Run Code Online (Sandbox Code Playgroud)

然而,这也意味着只有在这个较高的超时之后才会检测到elasticsearch节点故障。请参阅文档了解更多详细信息