小编ngc*_*359的帖子

JSON格式的Spark DataFrame列上的隐式模式发现

我正在Scala中编写ETL Spark（2.4）作业，该读取;在S3上具有glob模式的分隔CSV文件。数据已加载到DataFrame中，并包含一列（即名为custom）和JSON格式的字符串（多层嵌套）。目标是从该列自动推断模式，以便可以针对S3中的Parquet文件上的写接收器进行结构化。

这篇文章（如何使用Spark DataFrames查询JSON数据列？）建议schema_of_jsonSpark 2.4可以从JSON格式的列或字符串推断模式。

这是我尝试过的：

val jsonSchema: String = df.select(schema_of_json(col("custom"))).as[String].first

df.withColumn(
    "nestedCustom",
    from_json(col("custom"), jsonSchema, Map[String, String]())
)

Run Code Online (Sandbox Code Playgroud)

但是上述方法不起作用并引发此异常：

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'schemaofjson(`custom`)' due to data type mismatch: The input json should be a string literal and not null; however, got `custom`.;;
'Project [schemaofjson(custom#7) AS schemaofjson(custom)#16]

Run Code Online (Sandbox Code Playgroud)

请记住，我正在custom为此DataFrame 筛选出空值。

编辑：下面的整个代码。

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'schemaofjson(`custom`)' due to data type mismatch: The …

Run Code Online (Sandbox Code Playgroud)

scala apache-spark

ngc*_*359

2019 02-14

5
推荐指数

1
解决办法

1546
查看次数

标签统计

apache-spark ×1

scala ×1

JSON格式的Spark DataFrame列上的隐式模式发现

标签 统计

小编ngc_359的帖子

标签统计