小编Eri*_*ero的帖子

如何从包含Enums的案例类创建Spark Dataset或Dataframe

我一直在尝试使用包含枚举的案例类创建Spark数据集,但我无法做到.我正在使用Spark版本1.6.0.例外是抱怨我的枚举没有找到编码器.这在Spark中不可能在数据中包含枚举吗?

码:

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object MyEnum extends Enumeration {
  type MyEnum = Value
  val Hello, World = Value
}

case class MyData(field: String, other: MyEnum.Value)

object EnumTest {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    val sc = new SparkContext(sparkConf)
    val sqlCtx = new SQLContext(sc)

    import sqlCtx.implicits._

    val df = sc.parallelize(Array(MyData("hello", MyEnum.World))).toDS()

    println(s"df: ${df.collect().mkString(",")}}")
  }

}
Run Code Online (Sandbox Code Playgroud)

错误:

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for com.company.MyEnum.Value
- field (class: "scala.Enumeration.Value", name: "other") …
Run Code Online (Sandbox Code Playgroud)

scala apache-spark apache-spark-sql

9
推荐指数
1
解决办法
5515
查看次数

由 UTFDataFormatException 引起的任务在 Spark 中不可序列化:编码的​​字符串太长

我在 Yarn 上运行 Spark 应用程序时遇到了一些问题。我有非常广泛的集成测试,运行时没有任何问题,但是当我在 YARN 上运行应用程序时,它会抛出以下错误:

17/01/06 11:22:23 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Task not serializable
org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.map(RDD.scala:323)
    at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1410)
    at com.orgx.yy.dd.check.DQCheck$class.runDQCheck(DQCheck.scala:24)
    at com.orgx.yy.dd.check.DQBatchCheck.runDQCheck(DQBatchCheck.scala:13)
    at com.orgx.yy.dd.check.DQBatchCheck.doCheck(DQBatchCheck.scala:23)
    at com.orgx.yy.dd.DQChecker$.main(DQChecker.scala:60)
    at com.orgx.yy.dd.DQChecker.main(DQChecker.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
Caused by: java.io.UTFDataFormatException: encoded string too long: 72887 bytes
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
    at …
Run Code Online (Sandbox Code Playgroud)

scala hadoop-yarn apache-spark

1
推荐指数
1
解决办法
2224
查看次数