Spark作业中的Scala成员字段可见性

Ken*_*ams 2 visibility scala initialization apache-spark

我有一个这样定义的Scala类:

import org.apache.spark.{SparkConf, SparkContext}

object TestObject extends App{
  val FAMILY = "data".toUpperCase

  override def main(args: Array[String]) {
    val sc = new SparkContext(new SparkConf())

    sc.parallelize(1 to 10)
      .map(getData)
      .saveAsTextFile("my_output")
  }

  def getData(i: Int) = {
    ( i, FAMILY, "data".toUpperCase )
  }
}
Run Code Online (Sandbox Code Playgroud)

我将其提交给YARN集群,如下所示:

HADOOP_CONF_DIR=/etc/hadoop/conf spark-submit \
    --conf spark.hadoop.validateOutputSpecs=false \
    --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.2.1-hadoop2.4.0.jar \
    --deploy-mode=cluster \
    --master=yarn \
    --class=TestObject \
    target/scala-2.11/myjar-assembly-1.1.jar
Run Code Online (Sandbox Code Playgroud)

出乎意料的是,输出如下所示,表明该getData方法看不到的值FAMILY

(1,null,DATA)
(2,null,DATA)
(3,null,DATA)
(4,null,DATA)
(5,null,DATA)
(6,null,DATA)
(7,null,DATA)
(8,null,DATA)
(9,null,DATA)
(10,null,DATA)
Run Code Online (Sandbox Code Playgroud)

关于字段,作用域和可见性以及火花提交,对象和单例以及诸如此类的东西,我需要了解什么才能理解为什么会这样?如果我基本上希望将变量定义为该getData方法可见的“常量”,那我应该怎么做呢?

Ami*_*ico 5

I might be missing something, but I don't think you should be defining a main method. When you extend App, you inherit a main, and you should not override it since that is what actually invokes the code in your App.

For example, the simple class in your answer should be written

object TestObject extends App {
  val FAMILY = "data"
  println(FAMILY, "data")
}
Run Code Online (Sandbox Code Playgroud)