Kafka在某些节点上分区不同步

Aka*_*rot 7 apache-kafka apache-zookeeper

我在3个EC2实例上运行Kafka集群.每个实例运行kafka(0.11.0.1)和zookeeper(3.4).我的主题已配置,每个分区有20个分区,ReplicationFactor为3.

今天我注意到有些分区拒绝同步到所有三个节点.这是一个例子:

bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline    PartitionCount:20    ReplicationFactor:3    Configs:
    Topic: prod-decline    Partition: 0    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 1    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 2    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 3    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 4    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 5    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 6    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 7    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 8    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 9    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 10    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 11    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 12    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 13    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 14    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 15    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 16    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 17    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 18    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 19    Leader: 2    Replicas: 2,0,1    Isr: 2
Run Code Online (Sandbox Code Playgroud)

只有节点2才能使所有数据同步.我已经尝试重启经纪人0和1,但它没有改善情况 - 它使情况更糟.我很想重启节点2,但我认为它会导致停机或集群故障,所以我想尽可能避免它.

我没有在日志中看到任何明显的错误,所以我很难搞清楚如何调试这种情况.任何提示将非常感谢.

谢谢!

编辑:一些额外的信息...如果我检查节点2上的指标(具有完整数据的指标),它确实意识到某些分区未正确复制:

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;
Run Code Online (Sandbox Code Playgroud)

节点0和1没有.他们似乎认为一切都很好:

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;
Run Code Online (Sandbox Code Playgroud)

这是预期的行为吗?

muk*_*210 2

尝试增加replica.lag.time.max.ms.

解释是这样的:

如果副本未能发送获取请求的时间超过replica.lag.time.max.ms,则该副本将被视为死亡并从 ISR 中删除。

如果副本开始落后于领导者的时间超过replica.lag.time.max.ms,则认为它太慢并从 ISR 中删除。因此,即使流量出现峰值并且在领导者上写入大量消息,除非副本始终落后于领导者达replica.lag.time.max.ms,否则它不会随机进出 ISR。