无法在ElasticSearch IndexShardGatewayRecoveryException"发送失败"中启动分片

Rep*_*pox 9 sharding elasticsearch

我收到此错误,在我的ES日志中我使用了三个节点.

Caused by: java.lang.ArrayIndexOutOfBoundsException
[2014-09-08 13:53:56,167][WARN ][cluster.action.shard     ] [Dancing Destroyer] [events][3] sending failed shard for [events][3], node[RDZy21y7SRep7n6oWT8ogg], [P], s[INITIALIZING], indexUUID [gzj1aHTnQX6XDc0SxkvxDQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[events][3] failed recovery]; nested: FlushFailedEngineException[[events][3] Flush failed]; nested: ArrayIndexOutOfBoundsException; ]]
[2014-09-08 13:53:56,357][WARN ][indices.cluster          ] [Dancing Destroyer] [events][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [events][3] failed recovery
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.index.engine.FlushFailedEngineException: [events][3] Flush failed
        at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
        at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryFinalization(InternalIndexShard.java:726)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:249)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
        ... 3 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
[2014-09-08 13:53:56,381][WARN ][cluster.action.shard     ] [Dancing Destroyer] [events][3] sending failed shard for [events][3], node[RDZy21y7SRep7n6oWT8ogg], [P], s[INITIALIZING], indexUUID [gzj1aHTnQX6XDc0SxkvxDQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[events][3] failed recovery]; nested: FlushFailedEngineException[[events][3] Flush failed]; nested: ArrayIndexOutOfBoundsException; ]]
Run Code Online (Sandbox Code Playgroud)

这意味着ES的状态为红色,我丢失了近1000万份文档.这个错误意味着什么,以便我能够恢复?

Rep*_*pox 12

似乎我有一个混乱的碎片,需要修复.这是一个Lucene的东西,你告诉Lucene修复碎片.

对于Ubuntu的,解决办法是去到/usr/share/elasticsearch/lib目录,并找出哪些Lucene的核心版本正在运行(运行ls会告诉你一个类似的Lucene核心4.8.1.jar文件命名),然后键入:

java -cp lucene-core-x.x.x.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/<clustername>/nodes/0/indices/<index>/<shard>/index/ -fix
Run Code Online (Sandbox Code Playgroud)

将xxx替换为Lucene核心版本,使用您的clustername,索引和索引名称,当然还有失败的分片编号.

这可能会导致文档丢失

但它解决了我们的问题.


kar*_*k r 6

我多次遇到这个问题.由于我的设置是阅读点击流数据(每天12-20M点击),我无法承受数据丢失.

所以这是我的解决方案,它运行得很漂亮:

方案:

  1. 停止弹性搜索运行
  2. 转至/ path/to/my/data/mycluster_name/nodes/0/indices/myindex_name/index
  3. 删除segments.gen文件
  4. 启动elasticsearch

问题根本原因

  1. 碎片由于各种原因而失败,尤其是当碎片无法满足Kibana请求时.

  2. Lucene与此过程没有直接联系.因此,当存在问题时,elasticsearch无法有效地从存储在segments.gen中的Lucene段引用中选择分片值.

  3. Lucene在下次运行中再次将此值设置为新值.因此,elasticsearch能够正确地引用这些值.并且碎片问题已解决.