连接到远程Spark master - Java/Scala

cyb*_*ron 7 java hadoop scala amazon-ec2 apache-spark

Apache Spark在AWS中创建了一个3节点(1个主节点,2个工作节点)集群.我可以从主服务器向集群提交作业,但是我无法远程工作.

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "/usr/local/spark/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    sc.stop()
  }
}
Run Code Online (Sandbox Code Playgroud)

我可以从大师那里看到:

Spark Master at spark://ip-171-13-22-125.ec2.internal:7077
URL: spark://ip-171-13-22-125.ec2.internal:7077
REST URL: spark://ip-171-13-22-125.ec2.internal:6066 (cluster mode)
Run Code Online (Sandbox Code Playgroud)

所以当我SimpleApp.scala从本地机器执行时,它无法连接到Spark Master:

2017-02-04 19:59:44,074 INFO  [appclient-register-master-threadpool-0] client.StandaloneAppClient$ClientEndpoint (Logging.scala:54)  [] - Connecting to master spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077...
2017-02-04 19:59:44,166 WARN  [appclient-register-master-threadpool-0] client.StandaloneAppClient$ClientEndpoint (Logging.scala:87)  [] - Failed to connect to spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) ~[spark-core_2.10-2.0.2.jar:2.0.2]
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) ~[spark-core_2.10-2.0.2.jar:2.0.2]
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ~[scala-library-2.10.0.jar:?]
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) ~[spark-core_2.10-2.0.2.jar:2.0.2]
Run Code Online (Sandbox Code Playgroud)

但是,我知道如果我将主设置为local,它会起作用,因为它会在本地运行.但是,我想让我的客户端连接到这个远程主服务器.我怎么能做到这一点?Apache配置看起来像文件.我甚至可以telnet到公共DNS和端口,我还/etc/hosts为每个EC2实例配置了公共DNS和主机名.我希望能够向这位远程主人提交工作,我缺少什么?

aba*_*hel 8

对于绑定主机主机名/ IP,请转到spark安装conf目录(spark-2.0.2-bin-hadoop2.7/conf)并使用以下命令创建spark-env.sh文件.

cp spark-env.sh.template spark-env.sh
Run Code Online (Sandbox Code Playgroud)

在vi编辑器中打开spark-env.sh文件,并在下面添加主服务器的主机名/ IP.

SPARK_MASTER_HOST=ec2-54-245-111-320.compute-1.amazonaws.com
Run Code Online (Sandbox Code Playgroud)

使用stop-all.sh和start-all.sh停止并启动Spark.现在您可以使用它来连接远程主站

val spark = SparkSession.builder()
  .appName("SparkSample")
  .master("spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077")
  .getOrCreate()
Run Code Online (Sandbox Code Playgroud)

有关设置环境变量的更多信息,请查看http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

  • 有人可以解释为什么这个答案应该有效以及我要在哪台机器上施展这些咒语?这就像超级没有帮助。 (2认同)