Elasicsearch节点断开连接

run*_*arM 5 elasticsearch

我们遇到一个问题,即群集中的某些节点突然离开群集而没有任何明显的原因.

我们运行在Elasticsearch v0.20.6,JVM 7u25上.我们使用单播发现.

这是一个嵌入式ES实例,集群中有7个节点.节点47,48,49和50在一个位置(网络)上,24,25和26在另一个位置上.

每次都会发生同样的事情,在测试之间删除索引文件.24,25,26个节点中的一个突然认为它是主人(这又导致了裂脑情况 - 这是好的,我理解为什么会发生这种情况,但问题是为什么会发生断开连接.

首先,NODE47当选为高手.所有其他节点都加入,并且运行平稳了几个小时左右.

然后突然,这是第一个痕迹,那些东西明显出错了,大概19:10:

Node47:
2013-08-14 19:09:49,243 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] disconnected from [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}], channel closed event
2013-08-14 19:09:54,109 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] disconnected from [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], channel closed event
2013-08-14 19:10:06,008 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] disconnected from [[local][da-T28GDRtWgadrkCvxS-w][inet[/**NODE25**:8800]]{local=false}], channel closed event
2013-08-14 19:10:34,253 TRACE [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][generic][T#19]) [local] [node  ] [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}] transport disconnected (with verified connect)
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#24]) [local] connected to node [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}]
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#25]) [local] connected to node [[local][da-T28GDRtWgadrkCvxS-w][inet[/**NODE25**:8800]]{local=false}]
2013-08-14 19:10:34,273 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#26]) [local] connected to node [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}]
2013-08-14 19:10:34,290 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#27]) [local] disconnected from [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/**NODE24**:8800]]{local=false}]


Node24:
2013-08-14 19:10:35,167 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] pinging a master [local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false} but we do not exists on it, act as if its master failure
2013-08-14 19:10:35,170 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] stopping fault detection against master [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}], reason [master failure, do not exists on master, act as master failure]
2013-08-14 19:10:35,171 INFO  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#1]) [local] master_left [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}], reason [do not exists on master, act as master failure]
2013-08-14 19:10:35,174 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterService#updateTask][T#1]) [local] [master] restarting fault detection against master [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [possible elected master since master left (reason = do not exists on master, act as master failure)]
2013-08-14 19:10:35,181 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#1]) [local] disconnected from [[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/**NODE47**:8800]]{local=false}]
2013-08-14 19:10:36,233 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] pinging a master [local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false} that is no longer a master
2013-08-14 19:10:36,235 INFO  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#5]) [local] master_left [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [no longer master]
2013-08-14 19:10:36,235 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4}) [local] [master] stopping fault detection against master [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}], reason [master failure, no longer master]
2013-08-14 19:10:36,241 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][clusterService#updateTask][T#1]) [local] [master] restarting fault detection against master [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [possible elected master since master left (reason = no longer master)]
2013-08-14 19:10:36,245 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#5]) [local] disconnected from [[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/**NODE45**:8800]]{local=false}]
2013-08-14 19:10:37,359 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] [master] pinging a master [local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false} that is no longer a master
2013-08-14 19:10:37,361 INFO  [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#10]) [local] master_left [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [no longer master]
2013-08-14 19:10:37,363 DEBUG [org.elasticsearch.discovery.zen.fd] (elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3}) [local] [master] stopping fault detection against master [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}], reason [master failure, no longer master]
2013-08-14 19:10:37,393 DEBUG [org.elasticsearch.transport.netty] (elasticsearch[local][generic][T#10]) [local] disconnected from [[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/**NODE26**:8800]]{local=false}]
Run Code Online (Sandbox Code Playgroud)

据我所知,日志; 这是发生了什么:

19:09:49,243 - 从NODE24到NODE47(主站)收到一个频道关闭事件,它与19:10:3​​4,273断开连接 - 与NODE24的连接已完成,然后是19:10:3​​4,290 - 我们与之断开连接NODE24 19:10:3​​5,167 - NODE24 ping master(NODE47)但是主节点的节点列表中没有NODE24,而威胁就像主节点故障一样.

所有这一切都发生在一秒钟之内 - 唉,我所知道的工作没有超时.此外,没有大型GC或在此期间或之前可测量的任何减速.

我不知所措; 为什么会这样?如果是网络问题; 应该在网络端测试什么?

run*_*arM 2

我自己用行为的实际原因来回答这个问题;

2 个节点之间的 tcp 连接(同时保持与其他节点的连接)断开。可以使用 tcpkill 等实用程序重新创建它。

遗憾的是,Elasticsearch Zen 的发现并不能很好地处理这样的错误,而且各种奇怪的结果都是可能的。与主节点失去连接的节点将进行选举,并且可能会迷惑其他节点。