Spark流媒体,Kafka和多个主题表现不佳

Nic*_*las 8 apache-kafka apache-spark spark-streaming

Spark 2.1 + Kafka 0.10 + Spark流媒体.

批量持续时间为30秒.

我有13个节点,2个代理,每个主题/分区每个执行器使用1个核心.
LocationStrategy是PreferConsistent.
当消耗1个主题时,没有问题执行器总是处理相同的主题/分区(测试直到24个分区).
当我添加另一个主题时,用于处理主题/分区的一些执行程序会从一个批处理更改为另一个批处理.

当执行程序再次处理相同的主题/分区时(例如,之后的3个批处理,因此在上一个处理之后的1:30),由于来自代理的请求超时(request.timeout.ms参数),我得到了我的KafkaConsumer的断开连接然后我在40s期间阻止了对Kafka的新fetch查询(再次请求request.timeout.ms参数).

2017-10-09 16:51:30.336 DEBUG    [Executor task launch worker for task 315]:org.apache.spark.internal.Logging$class - Seeking to topic2-7 136136613
2017-10-09 16:51:30.336 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.KafkaConsumer - Seeking to offset 136136613 for partition topic2-7
2017-10-09 16:51:30.337 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient - Disconnecting from node 1005 due to request timeout.
2017-10-09 16:51:30.337 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler - Cancelled FETCH request ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@30ea3352, request=RequestSend(header={api_key=1,api_version=2,correlation_id=25,client_id=consumer-1}, body={replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=topic2,partitions=[{partition=7,fetch_offset=136125064,max_bytes=1048576}]}]}), createdTimeMs=1507557031875, sendTimeMs=1507557031875) with correlation id 25 due to node 1005 being disconnected
2017-10-09 16:51:30.338 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.Fetcher$1 - Fetch failed org.apache.kafka.common.errors.DisconnectException
2017-10-09 16:51:30.341 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater - Initialize connection to node 1006 for sending metadata request
2017-10-09 16:51:30.341 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient - Initiating connection to node 1006 at broker001.domain.loc:9092.
2017-10-09 16:51:30.342 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.common.metrics.Metrics - Added sensor with name node-1006.bytes-sent
2017-10-09 16:51:30.342 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.common.metrics.Metrics - Added sensor with name node-1006.bytes-received
2017-10-09 16:51:30.342 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.common.metrics.Metrics - Added sensor with name node-1006.latency
2017-10-09 16:51:30.343 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient - Completed connection to node 1006
2017-10-09 16:51:30.343 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler - Cancelled FETCH request ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@7d9e82c8, request=RequestSend(header={api_key=1,api_version=2,correlation_id=26,client_id=consumer-1}, body={replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=topic2,partitions=[{partition=7,fetch_offset=136136613,max_bytes=1048576}]}]}), createdTimeMs=1507557090341, sendTimeMs=0) with correlation id 26 due to node 1005 being disconnected
2017-10-09 16:51:30.343 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.Fetcher$1 - Fetch failed org.apache.kafka.common.errors.DisconnectException
2017-10-09 16:51:30.343 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater - Sending metadata request {topics=[topic2]} to node 1006
2017-10-09 16:51:30.344 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler - Cancelled FETCH request ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@4512b012, request=RequestSend(header={api_key=1,api_version=2,correlation_id=27,client_id=consumer-1}, body={replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=topic2,partitions=[{partition=7,fetch_offset=136136613,max_bytes=1048576}]}]}), createdTimeMs=1507557090343, sendTimeMs=0) with correlation id 27 due to node 1005 being disconnected
2017-10-09 16:51:30.344 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.Fetcher$1 - Fetch failed org.apache.kafka.common.errors.DisconnectException
2017-10-09 16:51:30.344 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.Metadata - Updated cluster metadata version 3 to Cluster(nodes = [broker002.domain.loc:9092 (id: 1005 rack: null), broker001.domain.loc:9092 (id: 1006 rack: null)], partitions = [Partition(topic = topic2, partition = 14, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 13, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 12, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 11, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 10, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 9, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 8, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 7, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 6, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 5, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 4, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 3, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 2, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,], Partition(topic = topic2, partition = 1, leader = 1005, replicas = [1005,1006,], isr = [1005,1006,], Partition(topic = topic2, partition = 0, leader = 1006, replicas = [1005,1006,], isr = [1006,1005,]])
2017-10-09 16:51:30.345 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler - Cancelled FETCH request ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@4214186f, request=RequestSend(header={api_key=1,api_version=2,correlation_id=29,client_id=consumer-1}, body={replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=topic2,partitions=[{partition=7,fetch_offset=136136613,max_bytes=1048576}]}]}), createdTimeMs=1507557090344, sendTimeMs=0) with correlation id 29 due to node 1005 being disconnected
2017-10-09 16:51:30.345 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.Fetcher$1 - Fetch failed org.apache.kafka.common.errors.DisconnectException
2017-10-09 16:51:42.942 DEBUG    [LeaseRenewer:hdfs_user@master001.domain.loc:8020]:org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [] with renew id 1 executed
2017-10-09 16:52:00.293 DEBUG    [IPC Client (1926664485) connection to master001.domain.loc/10.0.10.1:8020 from hdfs_user]:org.apache.hadoop.ipc.Client$Connection - IPC Client (1926664485) connection to master001.domain.loc/10.0.10.1:8020 from hdfs_user: closed
2017-10-09 16:52:00.293 DEBUG    [IPC Client (1926664485) connection to master001.domain.loc/10.0.10.1:8020 from hdfs_user]:org.apache.hadoop.ipc.Client$Connection - IPC Client (1926664485) connection to master001.domain.loc/10.0.10.1:8020 from hdfs_user: stopped, remaining connections 0
2017-10-09 16:52:10.388 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler - Cancelled FETCH request ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@4b954a27, request=RequestSend(header={api_key=1,api_version=2,correlation_id=30,client_id=consumer-1}, body={replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=topic2,partitions=[{partition=7,fetch_offset=136136613,max_bytes=1048576}]}]}), createdTimeMs=1507557090345, sendTimeMs=0) with correlation id 30 due to node 1005 being disconnected
2017-10-09 16:52:10.389 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.consumer.internals.Fetcher$1 - Fetch failed org.apache.kafka.common.errors.DisconnectException
2017-10-09 16:52:10.389 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient - Initiating connection to node 1005 at broker002.domain.loc:9092.
2017-10-09 16:52:10.390 DEBUG    [Executor task launch worker for task 315]:org.apache.kafka.clients.NetworkClient - Completed connection to node 1005
2017-10-09 16:52:10.397 DEBUG    [Executor task launch worker for task 315]:org.apache.spark.internal.Logging$class - Polled [topic2-7]  2603
2017-10-09 16:52:10.398 DEBUG    [Executor task launch worker for task 315]:org.apache.spark.internal.Logging$class - Getting local block broadcast_13
2017-10-09 16:52:10.398 DEBUG    [Executor task launch worker for task 315]:org.apache.spark.internal.Logging$class - Level for block broadcast_13 is StorageLevel(disk, memory, deserialized, 1 replicas)
Run Code Online (Sandbox Code Playgroud)

我该怎么做才能克服这类问题?增加request.timeout.ms参数对我来说似乎不是一个好的解决方案.

我已经看到一个参数来禁用Kafka消费者的缓存,它可以解决这个问题,但它可以在Spark 2.2中使用,我不能去Spark 2.2.

我现在只能看到的解决方案应该是回到单声道主题处理...

谢谢您的帮助!

2017/10/18:有关此问题的更新
处理主题/分区的执行程序的切换是由于数据位置问题.对于某些主题/分区,本地处理数据所需的执行程序(位置级别PROCESS_LOCAL)不可用,因此另一个执行程序被安排处理(位置级别RACK_LOCAL),并且此执行程序可以从批处理到另一个不同.

我的配置是每个执行者1个核心.
我更改了配置以允许每个执行程序使用2个核心,并且可以,所有任务都在本地处理.
如果想要处理3个主题,我必须将我的配置更改为每个执行程序3个核心(主题不均匀,topic1为15个分区,topic2为3个,主题3为6个,例如3个主题).

1个主题,24个主题/分区,24个执行程序,每个执行程序1个核心:确定
2个主题,24个主题/分区,12个执行程序,每个执行程序2个核心:确定
3个主题,24个主题/分区,8个执行程序,每个执行程序3个核心:确定
4个主题,24个主题/分区,6个执行程序,每个执行程序4个核心:确定
6个主题,24个主题/分区,4个执行程序,每个执行程序6个核心:KO

有6个主题,我再次运行数据局部性问题.我可以做些什么来扩展我的Spark流程的主题数量?

Ofe*_*Hod 0

对 RDD执行重新分区,它将触发洗牌并确保每个执行器都有几乎相同的本地数据(内存中)要处理。
对于您的 6 个主题示例,尝试使用 12 个执行程序,每个执行程序 2 个核心,并且.repartition(48). 在 Kafka Consumer 对给定 RDD 进行任何转换/操作之前
调用重新分区。

请注意,重新分区可能会对性能产生影响。