将文件上传到Kafka并进一步处理？

Question

将文件上传到Kafka并进一步处理？

0 apache-kafka

将上传文件的二进制数据发送到 Kafka，然后由连接到 Kafka 主题的某些服务分发处理上传，这是一个好方法吗？

我看到了一些优点：

过滤上传数据
复制品
有些服务可以处理上传，而不仅仅是一个

你对此有何看法？

Answer 1

Jav*_*cal 5

Is it good way to send binary data of uploading files to Kafka then to distribute handling uploading by some services that are connected to Kafka topic?

Typically files are uploaded to file system and their URIs are stored in the Kafka message. This is to ensure that the Kafka message size is relatively smaller, thereby increasing the throughput of its clients.

In case, if we put large objects in Kafka message, the consumer would have to read the entire file. So your poll() will take longer time than usual.

On the other hand, if we just put a URI of the file instead of the file itself, then the message consumption will be relatively faster and you can delegate the processing of files to perhaps another thread (possibly from a thread pool), there by increasing your application throughput.

Replicas

Just as there are replicas in Kafka, there can also be replicas for filesystem. Even kafka stores messages in file system (as segment files). So, the replication may as well be done with filesystem itself.

The best way is to put an URI that points to the file in the Kafka message and then put a handler for that URI which will be reponsible for giving you the file and possibly taking care of giving you a replica in case the original file is deleted.

The handler may be loosely-coupled from the rest of your system, built specifically for managing the files, maintaining replicas etc.

Filtering uploading data

The filtering of uploaded data can be done only when you actually read the contents of the file. You may do that even by putting the URI of your file in the message and reading from there. For ex, if you are using Kafka streams, you can put that filtering logic in transform() or mapValues() etc.

stream.from(topic)
.mapValues(v -> v.getFileURI())
.filter((k,fileURI) -> validate(read(fileURI)))
.to(..)

Run Code Online (Sandbox Code Playgroud)

Hitting segment.bytes

Another disadvantage of storing files in your message is that, you might hit segment.bytes limit if the files are larger. You need to keep changing the segment.bytes every time to meet the new size requirements of the files.

Another point is, if the segment.bytes is set to 1GB and your first message (file) size is 750MB, and your next message is 251 MB, the 251MB message can't fit in the first segment, so your first segment will have only one message, though it hasn't reached the limit. This means that relatively lower number of messages will be stored per segment.

归档时间：	5 年，6 月前
查看次数：	2896 次
最近记录：	5 年，6 月前