如何将 Spark 流输出转换为数据帧或存储在表中

Question

如何将 Spark 流输出转换为数据帧或存储在表中

hun*_*uny 1 scala apache-spark spark-streaming apache-spark-sql

我的代码是：

val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("hello" -> 5))
val data=lines.map(_._2)
data.print()

Run Code Online (Sandbox Code Playgroud)

我的输出有 50 个不同的值，格式如下

{"id:st04","data:26-02-2018 20:30:40","temp:30", "press:20"}

Run Code Online (Sandbox Code Playgroud)

任何人都可以帮助我将这些数据存储在表格形式中

| id |date               |temp|press|   
|st01|26-02-2018 20:30:40| 30 |20   |  
|st01|26-02-2018 20:30:45| 80 |70   |

Run Code Online (Sandbox Code Playgroud)

我会非常感激。

Answer 1

T. *_*ęda 5

您可以将 foreachRDD 函数与普通 Dataset API 一起使用：

data.foreachRDD(rdd => {
    // rdd is RDD[String]
    // foreachRDD is executed on the  driver, so you can use SparkSession here; spark is SparkSession, for Spark 1.x use SQLContext
    val df = spark.read.json(rdd); // or sqlContext.read.json(rdd)
    df.show(); 
    df.write.saveAsTable("here some unique table ID");
});

Run Code Online (Sandbox Code Playgroud)

但是，如果您使用 Spark 2.x，我建议使用 Structured Streaming：

val stream = spark.readStream.format("kafka").load()
val data = stream
            .selectExpr("cast(value as string) as value")
            .select(from_json(col("value"), schema))
data.writeStream.format("console").start();

Run Code Online (Sandbox Code Playgroud)

您必须手动指定架构，但这非常简单:)也在org.apache.spark.sql.functions._任何处理之前导入

归档时间：	8 年前
查看次数：	5926 次
最近记录：	6 年，7 月前