Alb*_*bus 1 java json apache-spark spark-streaming
我在 Java 中有以下字符串
{
"header": {
"gtfs_realtime_version": "1.0",
"incrementality": 0,
"timestamp": 1528460625,
"user-data": "metra"
},
"entity": [{
"id": "8424",
"vehicle": {
"trip": {
"trip_id": "UP-N_UN314_V1_D",
"route_id": "UP-N",
"start_time": "06:17:00",
"start_date": "20180608",
"schedule_relationship": 0
},
"vehicle": {
"id": "8424",
"label": "314"
},
"position": {
"latitude": 42.10085,
"longitude": -87.72896
},
"current_status": 2,
"timestamp": 1528460601
}
}
]
}
Run Code Online (Sandbox Code Playgroud)
表示 JSON 文档。我想为流应用程序推断Spark Dataframe 中的模式。
如何将 String 的字段拆分为类似于 CSV 文档(我可以在其中调用.split(""))?
引用官方文档Schema inference and partition of streaming DataFrames/Datasets:
默认情况下,来自基于文件的源的结构化流需要您指定架构,而不是依赖 Spark 自动推断它。此限制确保一致的架构将用于流式查询,即使在失败的情况下也是如此。对于临时用例,您可以通过设置
spark.sql.streaming.schemaInference为 true来重新启用模式推断。
然后,您可以使用spark.sql.streaming.schemaInference配置属性来启用架构推断。我不确定这是否适用于 JSON 文件。
我通常做的是加载单个文件(在批处理查询中和开始流式查询之前)来推断模式。这应该适用于您的情况。只需执行以下操作。
// I'm leaving converting Scala to Java as a home exercise
val jsonSchema = spark
.read
.option("multiLine", true) // <-- the trick
.json("sample.json")
.schema
scala> jsonSchema.printTreeString
root
|-- entity: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- vehicle: struct (nullable = true)
| | | |-- current_status: long (nullable = true)
| | | |-- position: struct (nullable = true)
| | | | |-- latitude: double (nullable = true)
| | | | |-- longitude: double (nullable = true)
| | | |-- timestamp: long (nullable = true)
| | | |-- trip: struct (nullable = true)
| | | | |-- route_id: string (nullable = true)
| | | | |-- schedule_relationship: long (nullable = true)
| | | | |-- start_date: string (nullable = true)
| | | | |-- start_time: string (nullable = true)
| | | | |-- trip_id: string (nullable = true)
| | | |-- vehicle: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- label: string (nullable = true)
|-- header: struct (nullable = true)
| |-- gtfs_realtime_version: string (nullable = true)
| |-- incrementality: long (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- user-data: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
诀窍是使用multiLine选项,因此整个文件是您用来推断架构的单行。
| 归档时间: |
|
| 查看次数: |
9802 次 |
| 最近记录: |