python elasticsearch bulk index datatype

Nar*_* MG 3 python elasticsearch

I am using the following code to create an index and load data in elastic search

from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
es = Elasticsearch('localhost:9200')
index_name='wordcloud_data'
with open('./csv-data/' + index_name +'.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index=index_name, doc_type='my-type')

print ("done")
Run Code Online (Sandbox Code Playgroud)

My CSV data is as follows

date,word_data,word_count
2017-06-17,luxury vehicle,11
2017-06-17,signifies acceptance,17
2017-06-17,agency imposed,16
2017-06-17,customer appreciation,11
Run Code Online (Sandbox Code Playgroud)

The data loads fine but then the datatype is not accurate How do I force it to say that the word_count is integer and not text See how it figures out the date type ? Is there a way it can figure out the int datatype automatically ? or by passing some parameter ?

Also what do I do to increase the ignore_above or remove it for some of the fields if I wanted to. basically no limit to the number of characters ?

{
  "wordcloud_data" : {
    "mappings" : {
      "my-type" : {
        "properties" : {
          "date" : {
            "type" : "date"
          },
          "word_count" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "word_data" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

drd*_*man 5

您需要创建一个描述字段类型的映射。

对于客户端,这可以使用或方法elasticsearch-py来完成,通过向其传递描述映射的 JSON 文档,如此 SO 答案所示。它会是这样的:es.indices.put_mappingindex.create

es.indices.put_mapping(
    index="wordcloud_data",
    doc_type="my-type",
    body={
        "properties": {  
            "date": {"type":"date"},
            "word_data": {"type": "text"},
            "word_count": {"type": "integer"}
        }
    }
)
Run Code Online (Sandbox Code Playgroud)

不过,我建议看一下elasticsearch-dsl提供更好的声明式 API 来描述事物的包。这将是类似的事情(未经测试):

from elasticsearch_dsl import DocType, Date, Integer, Text
from elasticsearch_dsl.connections import connections
from elasticsearch.helpers import bulk

connections.create_connection(hosts=["localhost"])

class WordCloud(DocType):
    word_data = Text()
    word_count = Integer()
    date = Date()

    class Index:
        name = "wordcloud_data"
        doc_type = "my_type"   # If you need it to be called so

WordCloud.init()
with open("./csv-data/%s.csv" % index_name) as f:
    reader = csv.DictReader(f)
    bulk(
        connections.get_connection(),
        (WordCloud(**row).to_dict(True) for row in reader)
    )
Run Code Online (Sandbox Code Playgroud)

请注意,我还没有尝试过我发布的代码 - 只是写了它。手头没有 ES 服务器来测试。可能会有一些小错误或错别字(如有请指出),但大体思路应该是正确的。