我想运行一个已经使用sbt package命令编译的独立Spark脚本。如何设置Scala脚本的正确配置以在IntelliJ IDE中运行我的脚本?当前,我正在使用带有以下命令的命令行来运行它(但例如,我想在IntelliJ中运行以进行进一步的调试):
~/spark-1.2.0/bin/spark-submit --class "CoinPipe" target/scala-2.10/coinpipe_2.10-1.0.jar /training/data/dir 7 12
贝娄是我正在尝试做的快照:

我有flagVectorOutlier如下代码所示的功能。我正在使用 BreezeDenseVector和DenseMatrix对象来计算distance. 我希望在函数签名上编码,以获得 Spark RDD[(Double, Boolean)]。mi和invCovMatrix分别是微风的DenseVector[Double]和DenseMatrix[Double]:
def flagVectorOutlier(testVectors: RDD[(String, SparkVector)], distanceThreshold: Double): RDD[(Double, Boolean)] = {
val testVectorsDenseRDD = testVectors.map { vector => DenseVector(vector._2.toArray)}
val mahalanobisDistancesRDD = testVectorsDenseRDD.map { vector =>
val distance = DenseVector[Double](DenseVector(Transpose(vector - mi) * invCovMatrix) * DenseVector(vector - mi)).toArray
(distance(0), if(distance(0) >= distanceThreshold) true else false)
}
mahalanobisDistancesRDD
}
Run Code Online (Sandbox Code Playgroud)
编译器最终向我展示了以下 2 个错误:
Error:(75, 93) could not find implicit value …Run Code Online (Sandbox Code Playgroud) 我正在运行这个片段来对点的RDD进行排序,对RDD进行排序并从给定点获取K-最近点:
def getKNN(sparkContext:SparkContext, k:Int, point2:Array[Double], pointsRDD:RDD[Array[Double]]): RDD[Array[Double]] = {
val tuplePointDistanceRDD:RDD[(Double, Array[Double])] = pointsRDD.map(point =>
(DistanceUtils.euclidianDistance(point, point2), point))
sparkContext.parallelize(tuplePointDistanceRDD.sortBy(_._1).map(_._2).take(k))
Run Code Online (Sandbox Code Playgroud)
}
在我的应用程序中只使用一个SparkContext并将其作为参数传递给我的函数,我在收到KNN点org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0的时刻收到错误.sparkContext.parallelize(tuplePointDistanceRDD.sortBy(_._1).map(_._2).take(k))point2
我正在构建sparkContext这个片段:
var sparkContext = new SparkContext("local", "<app_name>")
Run Code Online (Sandbox Code Playgroud)
面对这种错误的可能原因是什么?
基本上这是我的独立spark环境的LOG,其中包含此错误的堆栈跟踪:
15/12/24 11:55:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@localhost:55731]
15/12/24 11:55:29 INFO Utils: Successfully started service 'sparkDriver' on port 55731.
15/12/24 11:55:29 INFO SparkEnv: Registering MapOutputTracker
15/12/24 11:55:29 INFO SparkEnv: Registering BlockManagerMaster
15/12/24 11:55:29 INFO DiskBlockManager: Created …Run Code Online (Sandbox Code Playgroud) 我正在运行一个.jar文件,其中包含我需要打包的所有依赖项.其中一个依赖项com.google.common.util.concurrent.RateLimiter已被检查,它的类文件位于此.jar文件中.
不幸的是,当我在google的dataproc-cluster实例的主节点上点击命令spark-submit时,我收到此错误:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
at com.google.common.util.concurrent.RateLimiter$SleepingStopwatch$1.<init>(RateLimiter.java:417)
at com.google.common.util.concurrent.RateLimiter$SleepingStopwatch.createFromSystemTimer(RateLimiter.java:416)
at com.google.common.util.concurrent.RateLimiter.create(RateLimiter.java:130)
at LabeledAddressDatasetBuilder.publishLabeledAddressesFromBlockstem(LabeledAddressDatasetBuilder.java:60)
at LabeledAddressDatasetBuilder.main(LabeledAddressDatasetBuilder.java:144)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Run Code Online (Sandbox Code Playgroud)
在覆盖我的依赖关系的意义上似乎发生了一些事情.已经Stopwatch.class从这个.jar 反编译该文件并检查该方法是否存在.这恰好发生在我运行google dataproc实例时.我grep在执行的过程中做了spark-submit,我得到了-cp这样的标志:
/usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.5.0-hadoop2.7.1.jar:/usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/:/etc/hadoop/conf/:/usr/lib/hadoop/lib/native/:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*
Run Code Online (Sandbox Code Playgroud)
有什么办法可以解决这个问题吗?
谢谢.
我想执行一个我创建的python笔记本,用于与数据分类过程相关的另一个笔记本内的数据预处理.所以最后一个笔记本依赖于第一个笔记本提供的功能和执行.
我怎么能在谷歌云数据桌面环境中做到这一点?我想在分类笔记本上重复使用预处理笔记本中使用的函数和变量.
谢谢.