删除 ElasticSearch 中的重复记录

Sap*_*lio 6 elasticsearch elasticsearch-query

我在 ElasticSearch 中有数百万条记录。今天,我发现有些记录重复了。有什么办法可以删除这些重复的记录吗?

这是我的查询。

  {
  "query": {
        "filtered":{    
            "query" : {
                "bool": {"must":[ 
                        {"match": { "sensorId":  "14FA084408" }},
                  {"match": { "variableName":  "FORWARD_FLOW" }}
                  ]
                    }
            },  
            "filter": {
                "range": { "timestamp": { "gt" : "2015-07-04",
                                             "lt" : "2015-07-06" }}
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

这就是我从中得到的。

{
"took": 2,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 21,
    "max_score": 8.272615,
    "hits": [
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxVcMpd7AZtvmZcK",
            "_score": 8.272615,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxVnMpd7AZtvmZcL",
            "_score": 8.272615,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxV6Mpd7AZtvmZcN",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxWOMpd7AZtvmZcP",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxW8Mpd7AZtvmZcT",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxXFMpd7AZtvmZcU",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxXbMpd7AZtvmZcW",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxUtMpd7AZtvmZcG",
            "_score": 8.077545,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxXPMpd7AZtvmZcV",
            "_score": 8.077545,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxUZMpd7AZtvmZcE",
            "_score": 7.9553676,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

}

正如您所看到的,我在同一天有 21 条重复记录。如何删除重复记录并每天只保留一条?谢谢。

Vam*_*hna 4

进行计数(为此使用 Count API),然后使用按查询删除,查询大小比计数小 1。(使用按查询删除 + From/Size API 来获取此信息)

计数API

起价/尺码 API

通过查询API删除

在这种情况下,您应该编写查询,使其仅获取重复的记录。

或者只是查询 id 并调用批量删除(除了一个)。但是,我想你不能这样做,因为你没有身份证。恕我直言,我没有看到任何其他聪明的方法来做到这一点。

  • @Sapikelio 你能发布脚本吗?我也有同样的问题,但它有数百万条记录,我正在尝试找到最具可扩展性的方法。 (3认同)