如何自动化StructType创建以将RDD传递给DataFrame

duc*_*ito 6 scala apache-spark rdd spark-dataframe

我想保存RDD为镶木地板文件.为此,我将RDD传递给DataFrame然后使用结构保存DataFrame为镶木地板文件:

    val aStruct = new StructType(Array(StructField("id",StringType,nullable = true),
                                       StructField("role",StringType,nullable = true)))
    val newDF = sqlContext.createDataFrame(filtered, aStruct)
Run Code Online (Sandbox Code Playgroud)

问题是如何aStruct为所有列自动创建假设所有列都是StringType?另外,是什么意思nullable = true?这是否意味着所有空值都将被替换Null

eva*_*man 4

为什么不使用内置的toDF呢?

scala> val myRDD = sc.parallelize(Seq(("1", "roleA"), ("2", "roleB"), ("3", "roleC")))
myRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[60] at parallelize at <console>:27

scala> val colNames = List("id", "role")
colNames: List[String] = List(id, role)

scala> val myDF = myRDD.toDF(colNames: _*)
myDF: org.apache.spark.sql.DataFrame = [id: string, role: string]

scala> myDF.show
+---+-----+
| id| role|
+---+-----+
|  1|roleA|
|  2|roleB|
|  3|roleC|
+---+-----+

scala> myDF.printSchema
root
 |-- id: string (nullable = true)
 |-- role: string (nullable = true)

scala> myDF.write.save("myDF.parquet")
Run Code Online (Sandbox Code Playgroud)

简单nullable=true地意味着指定的列可以包含值(这对于通常没有值的列null非常有用-没有或)。intnullIntNAnull