KRaft - 无法向控制器法定人数注册

Kin*_*ord 5 apache-kafka docker nomad

我在 docker 中使用Apache Kafka V3.1 并尝试使用 Nomad 来编排它。我在创建分布式集群时遇到问题。

目标是在 3 个 EC2 实例上拥有 3 个代理/控制器节点

:~$ nslookup broker.service.brain.consul
Server:         127.0.0.1
Address:        127.0.0.1#53

Name:   broker.service.brain.consul
Address: 30.10.12.52
Name:   broker.service.brain.consul
Address: 30.10.11.8
Name:   broker.service.brain.consul
Address: 30.10.13.172
Run Code Online (Sandbox Code Playgroud)

从 Nomad 客户端实例之一内部:

IPv4 address for docker0: 172.17.0.1
IPv4 address for ens5:    30.10.13.172
IPv4 address for nomad:   172.26.64.1
Run Code Online (Sandbox Code Playgroud)

这是相关的Nomad Job配置

job "kafka" {
  datacenters = ["stream"]
  type = "service"
  group "broker" {
    count = 3
    service {
      name = "broker"
      port = "9092"
      tags = ["kafka","broker"]
      connect {
        sidecar_service {}
      }
    }
    network {
      mode = "bridge"
      hostname = "${attr.unique.hostname}"
      dns {
        servers = ["172.17.0.1"]
      }
      port "broker" {
        static = 9092
        to     = 9092
      }
      port "controler" {
        static = 9093
        to     = 9093
      }
    }
...
    task "broker" {

      driver = "docker"
      config {

        image = "registry.gitlab.com/.../kafka"
        volumes = ["files/server.properties:/kafka/config/kraft/server.properties"]
        

        ports = [
          "broker",
          "controler"
        ]
...
Run Code Online (Sandbox Code Playgroud)

从模板渲染后的server.properties如下所示:(跨 3 个代理的更改node.id

process.roles=broker,controller
node.id=2
controller.quorum.voters=1@30.10.11.8:9093,2@30.10.12.52:9093,3@30.10.13.172:9093
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
advertised.listeners=PLAINTEXT://:9092
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
num.network.threads=3
num.io.threads=8
request.timeout.ms=60000
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/logs/kraft-combined-logs
Run Code Online (Sandbox Code Playgroud)

但是,集群无法启动,这似乎是连接问题。


[2022-01-24 01:31:15,405] ERROR [BrokerLifecycleManager id=2] Shutting down because we were unable to register with the controller quorum. (kafka.server.BrokerLifecycleManager)
[2022-01-24 01:31:15,407] INFO [BrokerLifecycleManager id=2] registrationTimeout: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2022-01-24 01:31:15,407] INFO [BrokerLifecycleManager id=2] Transitioning from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
[2022-01-24 01:31:15,408] INFO [BrokerServer id=2] Transition from STARTING to STARTED (kafka.server.BrokerServer)
[2022-01-24 01:31:15,408] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat]: Shutting down (kafka.server.BrokerToControllerRequestThread)
[2022-01-24 01:31:15,409] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat]: Stopped (kafka.server.BrokerToControllerRequestThread)
[2022-01-24 01:31:15,410] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat]: Shutdown completed (kafka.server.BrokerToControllerRequestThread)
[2022-01-24 01:31:15,412] ERROR [BrokerServer id=2] Fatal error during broker startup. Prepare to shutdown (kafka.server.BrokerServer)
java.util.concurrent.CancellationException
    at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2396)
    at kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:478)
    at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:174)
    at java.base/java.lang.Thread.run(Thread.java:829)
[2022-01-24 01:31:15,417] INFO [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)

...

also 

...

[2022-01-24 02:02:19,304] INFO [RaftManager nodeId=2] Disconnecting from node 1 due to socket connection setup timeout. The timeout value is 10341 ms. (org.apache.kafka.clients.NetworkClient)
[2022-01-24 02:02:19,306] INFO [RaftManager nodeId=2] Disconnecting from node 3 due to socket connection setup timeout. The timeout value is 11036 ms. (org.apache.kafka.clients.NetworkClient)
[2022-01-24 02:02:20,100] INFO [RaftManager nodeId=2] Re-elect as candidate after election backoff has completed (org.apache.kafka.raft.KafkaRaftClient)
Run Code Online (Sandbox Code Playgroud)

我确实尝试将其设置listeners为匹配新的 docker hostnamehostname = "${attr.unique.hostname}"或 EC2 主机 IP,但这些都没有帮助。

我花了几天时间研究这个难题,但目前我没有想法。对于这个问题的任何帮助将不胜感激。

小智 1

我在 K8s 中遇到了类似的问题,图像拉取速度非常慢。事实上,第一个启动的实例必须等待另一个实例,但由于缺少另一个实例而仍然重新启动,该实例仍然处于镜像拉取状态。

initial.broker.registration.timeout.ms对于我的情况,将 Kafka 设置设置为240000(4 分钟)很有帮助。因此,第一个启动的 Kafka 实例等待另一个实例 4 分钟,并且它不会因前面提到的错误而退出。默认情况下有设置值60000(1 分钟),这对我来说太低了。

如果您仍然收到错误消息,我怀疑连接配置错误,其中 Kafka 实例无法互相看到。

我使用了bitnami版本 3.3.2 中的图像和参数:

- name: KAFKA_ENABLE_KRAFT
  value: "yes"
- name: KAFKA_KRAFT_CLUSTER_ID # must be generated and unique for each Kafka cluster
  value: "<<REDACTED>>"
- name: KAFKA_CFG_PROCESS_ROLES
  value: "controller,broker"
- name: KAFKA_CFG_CONTROLLER_QUORUM_VOTERS
  value: 0@kafka-inst-0:9093,1@kafka-inst-1:9093,2@kafka-inst-2:9093,3@kafka-inst-3:9093,...
- name: KAFKA_CFG_CONTROLLER_LISTENER_NAMES
  value: CONTROLLER
- name: KAFKA_CFG_LISTENERS
  value: "PLAINTEXT://:9092,CONTROLLER://:9093"
- name: BROKER_ID_COMMAND # need for get broker ID
  value: "hostname | awk -F'-' '{print $NF}'"
- name: KAFKA_HEAP_OPTS
  value: "-Xmx1G -Xms1G"
- name: ALLOW_PLAINTEXT_LISTENER
  value: "yes"
- name: KAFKA_CFG_INITIAL_BROKER_REGISTRATION_TIMEOUT_MS
  value: "240000"
Run Code Online (Sandbox Code Playgroud)