我可以在集群部署模式下运行pyspark jupyter笔记本吗？

Question

我可以在集群部署模式下运行pyspark jupyter笔记本吗？

J S*_*idt 6 apache-spark pyspark jupyter-notebook

上下文： 集群配置如下：

一切都与docker文件一起运行。
节点1：Spark Master
node2：jupyter中心（我也在其中运行笔记本）
节点3-7：Spark Worker节点
我可以使用spark的默认端口从我的工作节点telnet并ping通到node2，反之亦然

问题： 我试图在以集群部署模式运行的pyspark jupyter笔记本中创建一个Spark会话。我试图使驱动程序在不是运行jupyter笔记本的节点上运行。现在，我可以在群集上运行作业，但只能使用在node2上运行的驱动程序。

经过大量的挖掘，我发现了这个stackoverflow帖子，其中声称如果您使用spark运行交互式shell，则只能在本地部署模式（驱动程序位于您正在使用的计算机上）中进行。该帖子继续说，类似jupyter hub之类的结果也无法在集群部署模式下工作，但是我找不到任何可以证实这一点的文档。有人可以确认jupyter hub是否可以完全在集群模式下运行吗？

我尝试以集群部署模式运行spark会话：

from pyspark.sql import SparkSession

spark = SparkSession.builder\
.enableHiveSupport()\
.config("spark.local.ip",<node 3 ip>)\
.config("spark.driver.host",<node 3 ip>)\
.config('spark.submit.deployMode','cluster')\
.getOrCreate()

Run Code Online (Sandbox Code Playgroud)

错误：

/usr/spark/python/pyspark/sql/session.py in getOrCreate(self)
    167                     for key, value in self._options.items():
    168                         sparkConf.set(key, value)
--> 169                     sc = SparkContext.getOrCreate(sparkConf)
    170                     # This SparkContext may be an existing one.
    171                     for key, value in self._options.items():

/usr/spark/python/pyspark/context.py in getOrCreate(cls, conf)
    308         with SparkContext._lock:
    309             if SparkContext._active_spark_context is None:
--> 310                 SparkContext(conf=conf or SparkConf())
    311             return SparkContext._active_spark_context
    312 

/usr/spark/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    257         with SparkContext._lock:
    258             if not SparkContext._gateway:
--> 259                 SparkContext._gateway = gateway or launch_gateway(conf)
    260                 SparkContext._jvm = SparkContext._gateway.jvm
    261 

/usr/spark/python/pyspark/java_gateway.py in launch_gateway(conf)
     93                 callback_socket.close()
     94         if gateway_port is None:
---> 95             raise Exception("Java gateway process exited before     sending the driver its port number")
     96 
     97         # In Windows, ensure the Java child processes do not linger after Python has exited.

Exception: Java gateway process exited before sending the driver its port number

Run Code Online (Sandbox Code Playgroud)

Answer 1

hi-*_*zir 5

您根本不能在 PySpark 中使用集群模式：

目前，独立模式不支持 Python 应用程序的集群模式。

即使您可以集群模式也不适用于交互式环境：

case (_, CLUSTER) if isShell(args.primaryResource) =>
  error("Cluster deploy mode is not applicable to Spark shells.")
case (_, CLUSTER) if isSqlShell(args.mainClass) =>
  error("Cluster deploy mode is not applicable to Spark SQL shell.")

Run Code Online (Sandbox Code Playgroud)

只是想指出这对于独立模式是正确的（就像在原始问题中一样）。如果您使用其他集群管理器（例如 Yarn），您可以使用 PySpark 在集群模式下提交作业。 (3认同)

归档时间：	8 年，3 月前
查看次数：	3869 次
最近记录：	6 年，2 月前