How to improve parallel_bulk from python code for elastic insert?

azr*_*zro 6 python performance elasticsearch

I got some documents (size about 300o/doc) that I'd like to insert in my ES index using the python lib, I got huge time difference between the code and using curl it's obvious that it's normal, but I'd like to know if time can be improved (compared to the ratio of time)

  1. curl option takes about 20sec to insert and whole time 10sec (for printing ES result but after 20sec data is inserted)

    curl -H "Content-Type: application/json" -XPOST 
            "localhost:9200/contentindex/doc/_bulk?" --data-binary @superfile.bulk.json 
    
    Run Code Online (Sandbox Code Playgroud)
  2. With python option, I reached 1min20 as minimum, using the setting 10000/16/16 (chunk/thread/queue)

    import codecs
    from collections import deque
    from elasticsearch import Elasticsearch
    from elasticsearch.helpers import parallel_bulk
    
    es = Elasticsearch()
    
    def insert_data(filename, indexname):
        with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as fic:
            for line in fic:        
                json_line = {}
                json_line["data1"] = "random_foo_bar1"
                json_line["data2"] = "random_foo_bar2"
                # more fields ...        
                yield {
                    "_index": indexname,
                    "_type": "doc",
                    "_source": json_line
                }
    
    if __name__ == '__main__':
     pb = parallel_bulk(es, insert_data("superfile.bulk.json", "contentindex"), 
                           chunk_size=10000, thread_count=16, queue_size=16)
     deque(pb, maxlen=0)
    
    Run Code Online (Sandbox Code Playgroud)

Facts

  • I got a machine with 2 processors xeon 8-core and 64GB ram
  • I tried multiple values for each [100-50000]/[2-24]/[2-24]

Questions

  • Can I still improve the time ?

  • If not, should I think of a way to write the data on a file and then use a process for curl command ?


如果我只尝试解析部分,则需要 15 秒:

tm = time.time()
array = []

pb = insert_data("superfile.bulk.json", "contentindex") 
for p in pb:
   array.append(p)
print(time.time() - tm)            # 15

pb = parallel_bulk(es, array, chunk_size=10000, thread_count=16, queue_size=16)
dequeue(pb, maxlen = 0)
print(time.time() - tm)              # 90
Run Code Online (Sandbox Code Playgroud)

ozl*_*vka 8

经过我的测试:

  1. curl 比 python 客户端运行得更快,显然curl 实现得更好。

  2. 经过更多测试和参数调整后,我可以得出结论:

    1. Elasticsearch索引性能取决于索引和整个集群的配置。您可以通过将字段正确映射到索引来提高性能。
    2. 我最好的方法是使用 8 个线程和 10000 个项目块。这取决于index.index_concurrency的配置,默认为8。

    3. 我认为使用具有单独主节点的多节点集群应该可以提高性能。

    4. 有关更多信息,您可以阅读我发现的一篇由两部分组成的精彩文章:此处此处