Hav*_*nar 4 scala apache-spark
我想将2个数据帧与(可能)不匹配的模式合并
org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> A.unionAll(B)
Run Code Online (Sandbox Code Playgroud)
会导致:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;
Run Code Online (Sandbox Code Playgroud)
我想从Spark内部做到这一点.但是,Spark文档只建议将整个2个数据帧写入目录并在使用时将其读回spark.read.option("mergeSchema", "true")
.
所以联盟并没有帮助我,文档也没有.如果可能的话,我想把这个额外的I/O从我的工作中解脱出来.我错过了一些未记载的信息,还是不可能(还)?
默认情况下禁用镶木地板模式合并,通过以下方式打开此选项:
(1) set global option: spark.sql.parquet.mergeSchema=true
(2) write code: sqlContext.read.option("mergeSchema", "true").parquet("my.parquet")
Run Code Online (Sandbox Code Playgroud)
您可以将空列附加到帧B和联合2帧之后:
import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){
C = A.withColumn(field.name, expr("null"));
}
A.unionAll(C)
Run Code Online (Sandbox Code Playgroud)