我运行这个spark命令,使用Hortonworks vm成功运行spark Scala程序.但是一旦作业完成,它就不会退出spark-submit命令,直到我按下ctrl + C. 为什么?
spark-submit --class SimpleApp --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory12m --executor-cores 1 target/scala-2.10/application_2.10-1.0.jar /user/root/decks/largedeck.txt
Run Code Online (Sandbox Code Playgroud)
这是代码,我正在运行.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val cards = sc.textFile(args(0)).flatMap(_.split(" "))
val cardCount = cards.count()
println(cardCount)
}
}
Run Code Online (Sandbox Code Playgroud) 我能够运行此脚本以文本格式保存文件,但是当我尝试运行saveAsSequenceFile时,它出错了.如果有人知道如何将RDD保存为序列文件,请告诉我这个过程.我尝试在"学习Spark"以及官方Spark文档中寻找解决方案.
这成功运行
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments")
dataRDD.saveAsTextFile("/user/cloudera/pyspark/departments")
Run Code Online (Sandbox Code Playgroud)
这失败了
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments")
dataRDD.saveAsSequenceFile("/user/cloudera/pyspark/departmentsSeq")
Run Code Online (Sandbox Code Playgroud)
错误:调用z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile时发生错误.:org.apache.spark.SparkException:无法使用java.lang.String类型的RDD元素
这是数据:
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
8,TESTING
8000,TESTING
Run Code Online (Sandbox Code Playgroud)