如何使用selectExpr在spark数据帧中转换结构数组?

mah*_*hdi 4 sql scala dataframe apache-spark apache-spark-sql

如何在火花数据帧中投射结构数组?

让我通过一个例子来解释我想要做什么。我们将首先创建一个包含行数组和嵌套行的数据框。我的整数尚未在数据框中进行转换,它们被创建为字符串:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rows1 = Seq(
  Row("1", Row("a", "b"), "8.00", Seq(Row("1","2"), Row("12","22"))),
  Row("2", Row("c", "d"), "9.00", Seq(Row("3","4"), Row("33","44")))
)

val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

val schema1 = StructType(
  Seq(
    StructField("id", StringType, true),
    StructField("s1", StructType(
      Seq(
        StructField("x", StringType, true),
        StructField("y", StringType, true)
      )
    ), true),
    StructField("d", StringType, true),
    StructField("s2", ArrayType(StructType(
      Seq(
        StructField("u", StringType, true),
        StructField("v", StringType, true)
      )
    )), true)
  )
)

val df1 = spark.createDataFrame(rows1Rdd, schema1)
Run Code Online (Sandbox Code Playgroud)

这是创建的数据框的架构:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rows1 = Seq(
  Row("1", Row("a", "b"), "8.00", Seq(Row("1","2"), Row("12","22"))),
  Row("2", Row("c", "d"), "9.00", Seq(Row("3","4"), Row("33","44")))
)

val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

val schema1 = StructType(
  Seq(
    StructField("id", StringType, true),
    StructField("s1", StructType(
      Seq(
        StructField("x", StringType, true),
        StructField("y", StringType, true)
      )
    ), true),
    StructField("d", StringType, true),
    StructField("s2", ArrayType(StructType(
      Seq(
        StructField("u", StringType, true),
        StructField("v", StringType, true)
      )
    )), true)
  )
)

val df1 = spark.createDataFrame(rows1Rdd, schema1)
Run Code Online (Sandbox Code Playgroud)

我想要做的是将所有可以是整数的字符串转换为整数。我尝试执行以下操作,但没有奏效:

df1.selectExpr("CAST (id AS INTEGER) as id",
  "STRUCT (s1.x, s1.y) AS s1",
  "CAST (d AS DECIMAL) as d",
  "Array (Struct(CAST (s2.u AS INTEGER), CAST (s2.v AS INTEGER))) as s2").show()
Run Code Online (Sandbox Code Playgroud)

我有以下例外:

cannot resolve 'CAST(`s2`.`u` AS INT)' due to data type mismatch: cannot cast array<string> to int; line 1 pos 14;
Run Code Online (Sandbox Code Playgroud)

任何人都有正确的查询将所有值转换为 INTEGER ?我会很感激的。

非常感谢,

zer*_*323 5

你应该匹配一个完整的结构:

val result = df1.selectExpr(
  "CAST(id AS integer) id",
  "s1",
  "CAST(d AS decimal) d",
  "CAST(s2 AS array<struct<u:integer,v:integer>>) s2"
)
Run Code Online (Sandbox Code Playgroud)

这应该为您提供以下架构:

result.printSchema
Run Code Online (Sandbox Code Playgroud)
val result = df1.selectExpr(
  "CAST(id AS integer) id",
  "s1",
  "CAST(d AS decimal) d",
  "CAST(s2 AS array<struct<u:integer,v:integer>>) s2"
)
Run Code Online (Sandbox Code Playgroud)

和数据:

result.show
Run Code Online (Sandbox Code Playgroud)
result.printSchema
Run Code Online (Sandbox Code Playgroud)