Har*_*Koo 4 json scala apache-spark apache-spark-sql
在Spark SQL中,我可以使用
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.master("local")
.config("spark.sql.warehouse.dir", "warehouseLocation-value")
.getOrCreate()
val df = spark.read.json("source/myRecords.json")
df.createOrReplaceTempView("shipment")
val sqlDF = spark.sql("SELECT * FROM shipment")
Run Code Online (Sandbox Code Playgroud)
从"myRecords.json"获取数据,这个json文件的结构是:
df.printSchema()
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- container: struct (nullable = true)
| |-- barcode: string (nullable = true)
| |-- code: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我可以得到这个json的特定列,例如:
val sqlDF = spark.sql("SELECT container.barcode, container.code FROM shipment")
Run Code Online (Sandbox Code Playgroud)
但是如何从这个json文件中获取id.$ oid?我曾尝试"SELECT id.$oid FROM shipment_log"或"SELECT id.\$oid FROM shipment_log",但不是在所有的工作.错误信息:
error: invalid escape character
Run Code Online (Sandbox Code Playgroud)
谁能告诉我怎样才能得到id.$oid?
Backticks是你的朋友:
spark.read.json(sc.parallelize(Seq(
"""{"_id": {"$oid": "foo"}}""")
)).createOrReplaceTempView("df")
spark.sql("SELECT _id.`$oid` FROM df").show
Run Code Online (Sandbox Code Playgroud)
+----+
|$oid|
+----+
| foo|
+----+
Run Code Online (Sandbox Code Playgroud)
与DataFrameAPI 相同:
spark.table("df").select($"_id".getItem("$oid")).show
Run Code Online (Sandbox Code Playgroud)
+--------+
|_id.$oid|
+--------+
| foo|
+--------+
Run Code Online (Sandbox Code Playgroud)
要么
spark.table("df").select($"_id.$$oid")
Run Code Online (Sandbox Code Playgroud)
+--------+
|_id.$oid|
+--------+
| foo|
+--------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
313 次 |
| 最近记录: |