Mysql集群超过MaxBufferedEpochs

Gpo*_*ost 5 ndbcluster mysql-cluster

我有一个 mysql 集群,有 4 个 api 节点、2 个管理节点和 4 个数据节点。今天,我在尝试连接数据库时遇到问题,所有查询都挂在“打开表”状态。检查日志后,我在日志上发现了这些错误:

Api节点错误:

2015-08-20 19:44:14 15540 [Note] NDB Schema dist: Data node: 5 failed, subscriber bitmask 00
2015-08-20 19:44:14 15540 [Note] NDB Schema dist: Data node: 6 failed, subscriber bitmask 00
2015-08-20 19:44:14 15540 [Note] NDB Schema dist: Data node: 7 failed, subscriber bitmask 00
2015-08-20 19:44:14 15540 [Note] NDB Schema dist: Data node: 8 failed, subscriber bitmask 00
2015-08-20 19:44:14 15540 [Note] NDB Schema dist: cluster failure at epoch 3313124/17.
2015-08-20 19:44:14 15540 [Note] NDB Binlog: ndb tables initially read only on reconnect.
2015-08-20 19:44:14 15540 [ERROR] /opt/mysql/server-5.6/bin/mysqld: Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER
2015-08-20 19:44:14 15540 [ERROR] /opt/mysql/server-5.6/bin/mysqld: Sort aborted: Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER
2015-08-20 19:44:14 15540 [ERROR] Got error 4010 when reading table './database_name/table'
2015-08-20 19:44:14 15540 [Note] NDB Binlog: cluster failure for ./database_name/table_name at epoch 3313124/17.

mysql> show processlists;

Id  User    Host    db  Command Time    State   Info
1   system user     NULL    Daemon  1497    Waiting for ndbcluster to start NULL
Run Code Online (Sandbox Code Playgroud)

数据节点错误:

2015-08-20 19:44:14 [ndbd] ERROR -- c_gcp_list.seize() failed: gci: 14229759227592721 nodes: 0000000000000000000000000000040000000000000000000000000000001a00
2015-08-20 19:44:14 [ndbd] WARNING -- ACK wo/ gcp record (gci: 3313124/17) ref: 0fa2000b from: 0fa2000b
2015-08-20 19:44:14 [ndbd] WARNING -- ACK wo/ gcp record (gci: 3313124/17) ref: 0fa2000c from: 0fa2000c
2015-08-20 19:44:14 [ndbd] WARNING -- ACK wo/ gcp record (gci: 3313124/17) ref: 0fa2008a from: 0fa2008a
Run Code Online (Sandbox Code Playgroud)

管理节点错误:

2015-08-20 19:44:14 [MgmtSrvr] INFO     -- Node 5: Disconnecting lagging nodes '0000000000000000000000000000000000000000000000000000000000000200',
2015-08-20 19:44:14 [MgmtSrvr] WARNING  -- Node 5: Disconnecting node 9 because it has exceeded MaxBufferedEpochs (100 > 100), epoch 3313119/4
Run Code Online (Sandbox Code Playgroud)

详细的日志和配置

数据节点配置:

https://gist.github.com/sdemircan/730fa49fcc14b4376c42
Run Code Online (Sandbox Code Playgroud)

API节点配置:

https://gist.github.com/sdemircan/f9d230d32700b86564fd
Run Code Online (Sandbox Code Playgroud)

管理节点配置:

https://gist.github.com/sdemircan/d6fbd54799daaae01bf2
Run Code Online (Sandbox Code Playgroud)

API节点日志:

https://gist.github.com/sdemircan/2d62b1c92176de9de9d3
Run Code Online (Sandbox Code Playgroud)

数据节点日志:

https://gist.github.com/sdemircan/d0c97b82457a9c33deaa
Run Code Online (Sandbox Code Playgroud)

数据节点日志:

https://gist.github.com/sdemircan/3faa1e41367bc7655210
Run Code Online (Sandbox Code Playgroud)

管理节点日志:

https://gist.github.com/sdemircan/a026ac57757fafdafaa9
Run Code Online (Sandbox Code Playgroud)

什么可能使 MaxBufferedEpochs 达到上限?

KCD*_*KCD 0

您可能有一个大型事务,一个检索许多行的查询,拉回大量数据并使节点 9 的网络连接饱和。

重新连接时,API 节点是只读的,因此必须是在此之前执行的查询

NDB Binlog: ndb tables initially read only on reconnect
Run Code Online (Sandbox Code Playgroud)