Apache Toree连接到远程spark集群

yun*_*ang 8 apache-spark apache-toree

有没有办法将Apache Toree连接到远程火花群?我看到常见的命令是

jupyter toree install --spark_home=/usr/local/bin/apache-spark/
Run Code Online (Sandbox Code Playgroud)

如何在不必在本地安装的情况下在远程服务器上使用spark?

Jam*_*Con 6

确实有一种使Toree连接到远程Spark集群的方法。

我发现的最简单的方法是克隆现有的Toree Scala / Python内核,并创建一个新的Toree Scala / Python Remote内核。这样,您可以选择在本地或远程运行。

脚步:

  1. 复制现有内核。在我特定的Toree安装中,指向内核的路径位于:/usr/local/share/jupyter/kernels/,因此我执行了以下命令:
    cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/

  2. 在中编辑新kernel.json文件,/usr/local/share/jupyter/kernels/apache_toree_scala_remote/然后将必需的Spark选项添加到__TOREE_SPARK_OPTS__变量中。从技术上讲,仅--master <path>是必需的,但您也可以将--num-executors,-executor-memory等添加到变量中。

  3. 重新启动Jupyter。

我的kernel.json文件如下所示:

{
  "display_name": "Toree - Scala Remote",
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
    "--profile",
    "{connection_file}"
  ],
  "language": "scala",
  "env": {
    "PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
    "SPARK_HOME": "/opt/spark",
    "DEFAULT_INTERPRETER": "Scala",
    "PYTHON_EXEC": "python",
    "__TOREE_OPTS__": "",
    "__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
  }
}
Run Code Online (Sandbox Code Playgroud)