Spark 结构化流 Kafka 偏移量管理

Question

Spark 结构化流 Kafka 偏移量管理

Ant*_*ets 5 apache-kafka apache-spark spark-structured-streaming spark-kafka-integration

我正在考虑将 kafka 偏移量存储在 Spark Structured Streaming 的 kafka 内部，就像它适用于 DStreams 一样stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)，与我正在寻找的相同，但适用于 Structured Streaming。它支持结构化流吗？如果是，我怎样才能实现它？

我知道使用 hdfs 检查点.option("checkpointLocation", checkpointLocation)，但我对内置偏移管理非常感兴趣。

我期望 kafka 仅在没有 Spark hdfs 检查点的情况下存储偏移量。

Answer 1

小智 0

我正在使用在某处找到的这段代码。

public class OffsetManager {

    private String storagePrefix;

    public OffsetManager(String storagePrefix) {
        this.storagePrefix = storagePrefix;
    }

    /**
     * Overwrite the offset for the topic in an external storage.
     *
     * @param topic     - Topic name.
     * @param partition - Partition of the topic.
     * @param offset    - offset to be stored.
     */
    void saveOffsetInExternalStore(String topic, int partition, long offset) {

        try {

            FileWriter writer = new FileWriter(storageName(topic, partition), false);

            BufferedWriter bufferedWriter = new BufferedWriter(writer);
            bufferedWriter.write(offset + "");
            bufferedWriter.flush();
            bufferedWriter.close();

        } catch (Exception e) {
            e.printStackTrace();
            throw new RuntimeException(e);
        }
    }

    /**
     * @return he last offset + 1 for the provided topic and partition.
     */
    long readOffsetFromExternalStore(String topic, int partition) {

        try {

            Stream<String> stream = Files.lines(Paths.get(storageName(topic, partition)));

            return Long.parseLong(stream.collect(Collectors.toList()).get(0)) + 1;

        } catch (Exception e) {
            e.printStackTrace();
        }

        return 0;
    }

    private String storageName(String topic, int partition) {
        return "Offsets\\" + storagePrefix + "-" + topic + "-" + partition;
    }

}

Run Code Online (Sandbox Code Playgroud)

SaveOffset...在记录处理成功后调用，否则不存储偏移量。我使用 Kafka 主题作为源，因此我将起始偏移量指定为从 ReadOffsets 检索到的偏移量...

归档时间：	6 年，7 月前
查看次数：	1598 次
最近记录：	4 年，11 月前