如何在转换 Scala Spark DF -> RDD 时保留类型?

tSc*_*ema 2 scala apache-spark

我正在尝试将数据帧转换为 RDD。我的 DataFrame 输入了列,如下所示:

df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- num_hits: integer (nullable = true)
 |-- session_name: string (nullable = true)
 |-- user_id: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

当我使用 将其转换为 rdd 时df.rdd,我得到一个属于该类型的 rdd,Array[org.apache.spark.sql.Row]但是当我使用, 等访问每个条目rdd(0)(0)rdd(0)(1),我发现它们都有该类型Any。当我将它转换为 RDD 时,如何保持 DataFrame 具有的相同类型?换句话说:如何让我的 rdd 中的列具有类型Int, Int, String, String,以便它们与数据框匹配?

zsx*_*ing 5

您可以将您的转换DataFrameDataset[(Int, Int, String, String)],例如

scala> val df = Seq((1, 2, "a", "b")).toDF("_c0", "num_hits", "session_name", "user_id")
df: org.apache.spark.sql.DataFrame = [_c0: int, num_hits: int ... 2 more fields]

scala> df.printSchema
root
 |-- _c0: integer (nullable = false)
 |-- num_hits: integer (nullable = false)
 |-- session_name: string (nullable = true)
 |-- user_id: string (nullable = true)


scala> val rdd = df.as[(Int, Int, String, String)].rdd
rdd: org.apache.spark.rdd.RDD[(Int, Int, String, String)] = MapPartitionsRDD[3] at rdd at <console>:25
Run Code Online (Sandbox Code Playgroud)

如果_c0并且num_hits可以null,只需更改Intjava.lang.Integer