azr*_*zro 6 python performance elasticsearch
I got some documents (size about 300o/doc
) that I'd like to insert in my ES index using the python lib, I got huge time difference between the code and using curl
it's obvious that it's normal, but I'd like to know if time can be improved (compared to the ratio of time)
curl
option takes about 20sec to insert and whole time 10sec (for printing ES result but after 20sec data is inserted)
curl -H "Content-Type: application/json" -XPOST
"localhost:9200/contentindex/doc/_bulk?" --data-binary @superfile.bulk.json
Run Code Online (Sandbox Code Playgroud)
With python
option, I reached 1min20 as minimum, using the setting 10000/16/16
(chunk/thread/queue
)
import codecs
from collections import deque
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
es = Elasticsearch()
def insert_data(filename, indexname):
with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as fic:
for line in fic:
json_line = {}
json_line["data1"] = "random_foo_bar1"
json_line["data2"] = "random_foo_bar2"
# more fields ...
yield {
"_index": indexname,
"_type": "doc",
"_source": json_line
}
if __name__ == '__main__':
pb = parallel_bulk(es, insert_data("superfile.bulk.json", "contentindex"),
chunk_size=10000, thread_count=16, queue_size=16)
deque(pb, maxlen=0)
Run Code Online (Sandbox Code Playgroud)
Facts
[100-50000]/[2-24]/[2-24]
Questions
Can I still improve the time ?
If not, should I think of a way to write the data on a file and then use a process for curl
command ?
如果我只尝试解析部分,则需要 15 秒:
tm = time.time()
array = []
pb = insert_data("superfile.bulk.json", "contentindex")
for p in pb:
array.append(p)
print(time.time() - tm) # 15
pb = parallel_bulk(es, array, chunk_size=10000, thread_count=16, queue_size=16)
dequeue(pb, maxlen = 0)
print(time.time() - tm) # 90
Run Code Online (Sandbox Code Playgroud)