小编Bob*_*bby的帖子

Spark Streaming:无法计算拆分,未找到块

我试图使用Spark Streaming与Kafka(版本1.1.0)但由于此错误,Spark作业不断崩溃:

14/11/21 12:39:23 ERROR TaskSetManager: Task 3967.0:0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3967.0:0 failed 4 times, most recent failure: Exception failure in TID 43518 on host ********: java.lang.Exception: Could not compute split, block input-0-1416573258200 not found
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3967.0:0 failed 4 times, most recent failure: Exception failure in TID 43518 on host ********: java.lang.Exception: Could not compute …
Run Code Online (Sandbox Code Playgroud)

apache-spark spark-streaming

13
推荐指数
1
解决办法
5951
查看次数

将批处理RDD的结果与Apache Spark中的流式RDD相结合

上下文: 我使用Apache Spark从日志中聚合不同事件类型的运行计数.日志存储在Cassandra中用于历史分析目的,Kafka用于实时分析目的.每个日志都有一个日期和事件类型.为简单起见,我们假设我想跟踪每天单个类型的日志数量.

我们有两个RDD,来自Cassandra的批量数据的RDD和来自Kafka的另一个流式RDD.伪代码:

CassandraJavaRDD<CassandraRow> cassandraRowsRDD = CassandraJavaUtil.javaFunctions(sc).cassandraTable(KEYSPACE, TABLE).select("date", "type");

JavaPairRDD<String, Integer> batchRDD = cassandraRowsRDD.mapToPair(new PairFunction<CassandraRow, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(CassandraRow row) {
        return new Tuple2<String, Integer>(row.getString("date"), 1);
    }
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer count1, Integer count2) {
        return count1 + count2;
    }
});

save(batchRDD) // Assume this saves the batch RDD somewhere

...

// Assume we read a chunk of logs from the Kafka stream every x seconds.
JavaPairReceiverInputDStream<String, String> kafkaStream = …
Run Code Online (Sandbox Code Playgroud)

cassandra apache-kafka apache-spark spark-streaming

9
推荐指数
1
解决办法
6889
查看次数