使用递归案例类进行Spark

Question

使用递归案例类进行Spark

ole*_*ber 5 scala apache-spark apache-spark-sql apache-spark-dataset

我有一个递归的数据结构。Spark给出了这个错误：

Exception in thread "main" java.lang.UnsupportedOperationException: cannot have circular references in class, but got the circular reference of class BulletPoint

Run Code Online (Sandbox Code Playgroud)

作为示例，我做了以下代码：

case class BulletPoint(item: String, children: List[BulletPoint])

object TestApp extends App {
  val sparkSession = SparkSession
    .builder()
    .appName("spark app")
    .master(s"local")
    .getOrCreate()

  import sparkSession.implicits._

  sparkSession.createDataset(List(BulletPoint("1", Nil), BulletPoint("2", Nil)))
}

Run Code Online (Sandbox Code Playgroud)

有人知道如何解决这个问题吗？

Answer 1

104*_*ica 3

异常是相当明确的 - 默认情况下不支持这种情况。您必须记住它们Datasets被编码到关系模式中，因此所有必需的字段都必须预先声明并有界。这里没有递归结构的地方。

这里有一个小窗口 -二进制Encoders：

import org.apache.spark.sql.{Encoder, Encoders}

sparkSession.createDataset(List(
  BulletPoint("1", Nil), BulletPoint("2", Nil)
))(Encoders.kryo[BulletPoint])

Run Code Online (Sandbox Code Playgroud)

或同等学历：

implicit val bulletPointEncoder = Encoders.kryo[BulletPoint]

sparkSession.createDataset(List(
  BulletPoint("1", Nil), BulletPoint("2", Nil)
))

Run Code Online (Sandbox Code Playgroud)

但除非绝对必要，否则您确实不希望在代码中包含它。

归档时间：	6 年，9 月前
查看次数：	89 次
最近记录：	6 年，9 月前