如何在 Scala / Spark 中创建示例数据帧

Question

如何在 Scala / Spark 中创建示例数据帧

我正在尝试创建一个简单的DataFrame如下：

import sqlContext.implicits._

val lookup = Array("one", "two", "three", "four", "five")

val theRow = Array("1",Array(1,2,3), Array(0.1,0.4,0.5))

val theRdd = sc.makeRDD(theRow)

case class X(id: String, indices: Array[Integer], weights: Array[Float] )

val df = theRdd.map{
    case Array(s0,s1,s2) =>    X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Float]])
}.toDF()

df.show()

Run Code Online (Sandbox Code Playgroud)

df 定义为

df: org.apache.spark.sql.DataFrame = [id: string, indices: array<int>, weights: array<float>]

Run Code Online (Sandbox Code Playgroud)

这就是我想要的。

执行后，我得到

org.apache.spark.SparkException：作业因阶段失败而中止：阶段 13.0 中的任务 1 失败 1 次，最近一次失败：阶段 13.0 中丢失任务 1.0（TID 50，本地主机）：scala.MatchError：1（属于 java 类） .lang.String)

这个 MatchError 是从哪里来的？并且，是否有更简单的方法来以DataFrames编程方式创建示例？

Answer 1

Vij*_*ian 5

您可以参考另一个例子

import spark.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val columns=Array("id", "first", "last", "year")
val df1=sc.parallelize(Seq(
  (1, "John", "Doe", 1986),
  (2, "Ive", "Fish", 1990),
  (4, "John", "Wayne", 1995)
)).toDF(columns: _*)

val df2=sc.parallelize(Seq(
  (1, "John", "Doe", 1986),
  (2, "IveNew", "Fish", 1990),
  (3, "San", "Simon", 1974)
)).toDF(columns: _*)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Chr*_*sch 4

首先，theRow应该是 aRow而不是Array。现在，如果您以尊重 Java 和 Scala 之间兼容性的方式修改类型，那么您的示例将有效

val theRow =Row("1",Array[java.lang.Integer](1,2,3), Array[Double](0.1,0.4,0.5))
val theRdd = sc.makeRDD(Array(theRow))
case class X(id: String, indices: Array[Integer], weights: Array[Double] )
val df=theRdd.map{
    case Row(s0,s1,s2)=>X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Double]])
  }.toDF()
df.show()

//+---+---------+---------------+
//| id|  indices|        weights|
//+---+---------+---------------+
//|  1|[1, 2, 3]|[0.1, 0.4, 0.5]|
//+---+---------+---------------+

Run Code Online (Sandbox Code Playgroud)

请注意，您需要“import sqlContext.implicits._”才能使用“toDF” (2认同)

归档时间：	9 年，9 月前
查看次数：	13394 次
最近记录：	5 年，2 月前