Kar*_*ikJ 2 apache-spark parquet spark-structured-streaming
我正在使用 Spark 结构化流;我的 DataFrame 具有以下架构
root
|-- data: struct (nullable = true)
| |-- zoneId: string (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- timeSinceLast: long (nullable = true)
|-- date: date (nullable = true)
Run Code Online (Sandbox Code Playgroud)
如何使用 Parquet 格式执行 writeStream 并写入数据(包含 zoneId、deviceId、timeSinceLast;除日期之外的所有内容)并按日期分区数据?我尝试了以下代码,但分区子句不起作用
root
|-- data: struct (nullable = true)
| |-- zoneId: string (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- timeSinceLast: long (nullable = true)
|-- date: date (nullable = true)
Run Code Online (Sandbox Code Playgroud)
如果你想按日期分区,那么你必须在partitionBy()方法中使用它。
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("date")
.start()
Run Code Online (Sandbox Code Playgroud)
如果你想对数据进行分区,<year>/<month>/<day>你应该确保date列的DateType类型,然后创建适当格式的列:
val df = dataset.withColumn("date", dataset.col("date").cast(DataTypes.DateType))
df.withColumn("year", functions.date_format(df.col("date"), "YYYY"))
.withColumn("month", functions.date_format(df.col("date"), "MM"))
.withColumn("day", functions.date_format(df.col("date"), "dd"))
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("year", "month", "day")
.start()
Run Code Online (Sandbox Code Playgroud)