Union只能在具有兼容列类型Spark数据帧的表上执行

SUD*_*HAN 2 union scala dataframe apache-spark apache-spark-sql

这是我的联盟代码:

val dfToSave=dfMainOutput.union(insertdf.select(dfMainOutput).withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")))
Run Code Online (Sandbox Code Playgroud)

当我结合时,我得到以下错误:

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. string <> boolean at the 11th column of the second table;;
'Union
Run Code Online (Sandbox Code Playgroud)

这是两个数据帧的模式:

insertdf.printSchema()
root
 |-- OrganizationID: long (nullable = true)
 |-- SourceID: integer (nullable = true)
 |-- AuditorID: integer (nullable = true)
 |-- AuditorOpinionCode: string (nullable = true)
 |-- AuditorOpinionOnInternalControlCode: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernCode: string (nullable = true)
 |-- IsPlayingAuditorRole: boolean (nullable = true)
 |-- IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |-- AuditorEnumerationId: integer (nullable = true)
 |-- AuditorOpinionId: integer (nullable = true)
 |-- AuditorOpinionOnInternalControlsId: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernId: string (nullable = true)
 |-- IsPlayingCSRAuditorRole: boolean (nullable = true)
 |-- FFAction: string (nullable = true)
 |-- DataPartition: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

这是第二个数据帧的模式:

dfMainOutput.printSchema()
root
 |-- OrganizationID: long (nullable = true)
 |-- SourceID: integer (nullable = true)
 |-- AuditorID: integer (nullable = true)
 |-- AuditorOpinionCode: string (nullable = true)
 |-- AuditorOpinionOnInternalControlCode: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernCode: string (nullable = true)
 |-- IsPlayingAuditorRole: boolean (nullable = true)
 |-- IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |-- AuditorEnumerationId: integer (nullable = true)
 |-- AuditorOpinionId: integer (nullable = true)
 |-- AuditorOpinionOnInternalControlsId: integer (nullable = true)
 |-- AuditorOpinionOnGoingConcernId: boolean (nullable = true)
 |-- IsPlayingCSRAuditorRole: string (nullable = true)
 |-- FFAction: string (nullable = true)
 |-- DataPartition: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

为了避免这个问题,我可能需要select为每个列编写一个.那么是否有任何Scala语法设法输入caste或将两个数据帧设置为相同的类型?

这是我到目前为止尝试但仍然得到相同的错误:

val columns = dfMainOutput.columns.toSet.intersect(insertdf.columns.toSet).map(col).toSeq

//Perform Union
val dfToSave=dfMainOutput.select(columns: _*).union(insertdf.select(columns: _*)).withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")))
Run Code Online (Sandbox Code Playgroud)

Sha*_*ica 5

要执行数据帧的并集,每列的数据类型必须匹配.查看您的模式,有三列不符合这一点:

AuditorOpinionOnInternalControlsId
AuditorOpinionOnGoingConcernId
IsPlayingCSRAuditorRole
Run Code Online (Sandbox Code Playgroud)

更改数据类型的简单方法是使用withColumncast.我假设正确的类型是在dfMainOutput以下代码的数据框中:

val insertDfNew = insertdf
  .withColumn("AuditorOpinionOnInternalControlsId", $"AuditorOpinionOnInternalControlsId".cast(IntegerType))
  .withColumn("AuditorOpinionOnGoingConcernId", $"AuditorOpinionOnGoingConcernId".cast(BooleanType))
  .withColumn("IsPlayingCSRAuditorRole", $"IsPlayingCSRAuditorRole".cast(StringType))
  .withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")).otherwise($"FFAction"))

val dfToSave = dfMainOutput.union(insertDfNew)
Run Code Online (Sandbox Code Playgroud)