Aka*_*rot 7 apache-kafka apache-zookeeper
我在3个EC2实例上运行Kafka集群.每个实例运行kafka(0.11.0.1)和zookeeper(3.4).我的主题已配置,每个分区有20个分区,ReplicationFactor为3.
今天我注意到有些分区拒绝同步到所有三个节点.这是一个例子:
bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline PartitionCount:20 ReplicationFactor:3 Configs:
Topic: prod-decline Partition: 0 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 1 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 2 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 3 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 4 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 5 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 6 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 7 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 8 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 9 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 10 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 11 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 12 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 13 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 14 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 15 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 16 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 17 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 18 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 19 Leader: 2 Replicas: 2,0,1 Isr: 2
Run Code Online (Sandbox Code Playgroud)
只有节点2才能使所有数据同步.我已经尝试重启经纪人0和1,但它没有改善情况 - 它使情况更糟.我很想重启节点2,但我认为它会导致停机或集群故障,所以我想尽可能避免它.
我没有在日志中看到任何明显的错误,所以我很难搞清楚如何调试这种情况.任何提示将非常感谢.
谢谢!
编辑:一些额外的信息...如果我检查节点2上的指标(具有完整数据的指标),它确实意识到某些分区未正确复制:
$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;
Run Code Online (Sandbox Code Playgroud)
节点0和1没有.他们似乎认为一切都很好:
$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;
Run Code Online (Sandbox Code Playgroud)
这是预期的行为吗?
尝试增加replica.lag.time.max.ms.
解释是这样的:
如果副本未能发送获取请求的时间超过replica.lag.time.max.ms,则该副本将被视为死亡并从 ISR 中删除。
如果副本开始落后于领导者的时间超过replica.lag.time.max.ms,则认为它太慢并从 ISR 中删除。因此,即使流量出现峰值并且在领导者上写入大量消息,除非副本始终落后于领导者达replica.lag.time.max.ms,否则它不会随机进出 ISR。
| 归档时间: |
|
| 查看次数: |
2621 次 |
| 最近记录: |