Aeron Cluster 交付保证是什么

Fra*_*ank 1 aeron

在这篇文章中,有一条已批准的评论,其中包含以下声明:

集群通过使用仲裁协议将其提升到一个新的水平,以防止节点发生故障时消息丢失。

我正在测试一个集群节点发生故障时的传递,但根据我的观察,如果节点发生故障,消息可能会丢失。

我正在使用aeron 代码库io.aeron.samples.cluster.tutorial.BasicAuctionClusterClient(版本1.38.1)io.aeron.samples.cluster.tutorial.BasicAuctionClusterClient

我做了一个小调整,BasicAuctionClusterClient看看是否收到消息:

    public void onSessionMessage(
        final ClientSession session,
        final long timestamp,
        final DirectBuffer buffer,
        final int offset,
        final int length,
        final Header header)
    {
        final long correlationId = buffer.getLong(offset + CORRELATION_ID_OFFSET);                   // <1>
        System.out.println("Received message with correlation ID " + correlationId); // this line is added
        // the rest is the same

    }
Run Code Online (Sandbox Code Playgroud)

当我启动具有 3 个节点的集群时,其中 1 个被选为LEADER。然后我启动它开始向集群BasicAuctionClusterClient发送请求。

当我停止领导者时,新领导者会按预期选举出来,但从此时点到新领导者选举的消息永远不会到达集群(请参见下面的相关 ID 中的间隙)。

New role is LEADER
Received message with correlation ID -8046281870845246166
attemptBid(this=Auction{bestPrice=144, currentWinningCustomerId=1}, price=152,customerId=1)
Received message with correlation ID -8046281870845246165
attemptBid(this=Auction{bestPrice=152, currentWinningCustomerId=1}, price=158,customerId=1)
Consensus Module
io.aeron.cluster.client.ClusterEvent: WARN - leader heartbeat timeout
Received message with correlation ID -8046281870845246154
attemptBid(this=Auction{bestPrice=158, currentWinningCustomerId=1}, price=167,customerId=1)
Run Code Online (Sandbox Code Playgroud)

如果开发商希望保证交付(处理),他们应该做什么?是否期望在集群节点端拥有具有重试和重复请求处理功能的定制ack 系统?

Mic*_*zak 5

Aeron cluster 提供了某些保证,但它们与您所想到的保证略有不同。

我正在测试一个集群节点发生故障时的传递,但根据我的观察,如果节点发生故障,消息可能会丢失。

There is nothing unusual in losing last few messages that you published. There are many reason why it can happen. The process on the receiving side can die etc.

If I read the code of the io.aeron.cluster.client.AeronCluster#offer(org.agrona.DirectBuffer, int, int) correctly, it is a non blocking publication that does not wait for the message to be committed before returning control to the client. I use the word 'committed' as defined by the Raft protocol that Aeron Cluster implements. If you read the Raft paper, it says

Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers

If your messages were committed in a Raft sense before the previous leader died, your newly elected leader of a multi-node Aeron cluster will eventually process them in order.

Re your last question

What is expected from the developer to do in case they want to have the delivery (processing) guaranteed?

  • check if the offer result is not negative (e.g. io.aeron.Publication#NOT_CONNECTED) to detect issues earlier, but more importantly
  • use a higher level protocol with a sequence number/correlation Id that sends back ACKs from within your receiving io.aeron.cluster.service.ClusteredService implementation. It would guarantee that the message was committed in a Raft sense as it is a prerequisite to processing it by the Aeron Cluster state machine (onSessionMessage).