我正在尝试安装Apache Toree内核以实现火花兼容性,并且我遇到了一个奇怪的环境信息.这是我遵循的过程:
只对Scala Kernel真正感兴趣,但我安装了所有解释器.操作系统是Windows 7,没有选择使用虚拟机或Linux.
这是我修改为使用cygwin执行run.sh bash脚本的kernel.json文件:
{
"language": "scala",
"display_name": "Apache Toree - Scala",
"env": {
"__TOREE_SPARK_OPTS__": "",
"SPARK_HOME": "C:\\CDH\\spark",
"__TOREE_OPTS__": "",
"DEFAULT_INTERPRETER": "Scala",
"PYTHONPATH": "C:\\CDH\\spark\\python:C:\\CDH\\spark\\python\\lib\\py4j-0.8.2.1-src.zip",
"PYTHON_EXEC": "python"
},
"argv": [
"C:\\cygwin64\\bin\\mintty.exe","-h","always","/bin/bash","-l","-e","C:\\ProgramData\\jupyter\\kernels\\apache_toree_scala\\bin\\run.sh",
"--profile",
"{connection_file}"
]
}
Run Code Online (Sandbox Code Playgroud)
运行jupyter时,内核会因错误而停止:
TypeError: environment can only contain strings
Run Code Online (Sandbox Code Playgroud)
扩展日志:
[E 10:45:56.736 NotebookApp] Failed to run command:
['C:\\cygwin64\\bin\\mintty.exe', '-h', 'always', '/bin/bash', '-l', '-e', 'C:\\ProgramData\\jupyter\\kernels\\apache_toree_scala\\bin\\run.sh', '
--profile', 'C:\\Users\\luis\\AppData\\Roaming\\jupyter\\runtime\\kernel-e02cac9b-15de-4c69-a8e5-e5b11919e1bc.json']
with kwargs:
{'stdin': -1, 'stdout': None, …Run Code Online (Sandbox Code Playgroud) 有没有办法将Apache Toree连接到远程火花群?我看到常见的命令是
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
Run Code Online (Sandbox Code Playgroud)
如何在不必在本地安装的情况下在远程服务器上使用spark?
谷歌确实有很多解决这个问题的方法,但不幸的是,即使在尝试了所有的可能性之后,我也无法让它工作,所以请耐心等待,看看是否有什么让你感到震惊的事情。
操作系统:MAC
火花:1.6.3 (2.10)
Jupyter 笔记本:4.4.0
蟒蛇:2.7
斯卡拉:2.12.1
我能够成功安装并运行 Jupyter notebook。接下来,我尝试将其配置为与 Spark 一起使用,为此我使用 Apache Toree 安装了 Spark 解释器。现在,当我尝试在笔记本中运行任何 RDD 操作时,会抛出以下错误
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar
Run Code Online (Sandbox Code Playgroud)
已经尝试过的事情: 1. 在 .bash_profile 中设置 PYTHONPATH 2. 能够在本地的 python-cli 中导入“pyspark” 3. 尝试将解释器 kernel.json 更新为以下内容
{
"language": "python",
"display_name": "Apache Toree - PySpark",
"env": {
"__TOREE_SPARK_OPTS__": "",
"SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
"__TOREE_OPTS__": "",
"DEFAULT_INTERPRETER": "PySpark",
"PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
"PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
"PYTHON_EXEC": "python"
},
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
"--profile",
"{connection_file}"
] …Run Code Online (Sandbox Code Playgroud) 当我使用Apache Toree - Scala内核创建一个Jupyter笔记本时出现内核错误.这是堆栈跟踪:
Traceback (most recent call last):
File "C:\Users\darie\Anaconda3\lib\site-packages\notebook\base\handlers.py", line 516, in wrapper
result = yield gen.maybe_future(method(self, *args, **kwargs))
File "C:\Users\darie\Anaconda3\lib\site-packages\tornado\gen.py", line 1055, in run
value = future.result()
File "C:\Users\darie\Anaconda3\lib\site-packages\tornado\concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "C:\Users\darie\Anaconda3\lib\site-packages\tornado\gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "C:\Users\darie\Anaconda3\lib\site-packages\notebook\services\sessions\handlers.py", line 75, in post
type=mtype))
File "C:\Users\darie\Anaconda3\lib\site-packages\tornado\gen.py", line 1055, in run
value = future.result()
File "C:\Users\darie\Anaconda3\lib\site-packages\tornado\concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in …Run Code Online (Sandbox Code Playgroud) "Apache Toree - Scala"的语法突出显示无法正常工作.当我在单元格中编写一些代码时,Jupyter不会突出显示它.
我已经配置了kernel.json文件,但它没有帮助.有人知道解决这个问题的方法吗?
我在Jupyter上运行Scala Spark时遇到问题.当我在jupyter中加载Apache Toree - Scala笔记本时,下面是我的错误消息.
root@ubuntu-2gb-sgp1-01:~# jupyter notebook --ip 0.0.0.0 --port 8888
[I 03:14:54.281 NotebookApp] Serving notebooks from local directory: /root
[I 03:14:54.281 NotebookApp] 0 active kernels
[I 03:14:54.281 NotebookApp] The Jupyter Notebook is running at: http://0.0.0.0:8888/
[I 03:14:54.281 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 03:14:54.282 NotebookApp] No web browser found: could not locate runnable browser.
[I 03:15:09.976 NotebookApp] 302 GET / (61.6.68.44) 1.21ms
[I 03:15:15.924 NotebookApp] Creating new …Run Code Online (Sandbox Code Playgroud) 我试图安装Jupyter -支持星火 在畅达环境(这是我设置使用http://conda.pydata.org/docs/test-drive.html)的的蟒蛇分布.我正在尝试使用apache toree作为Jupyter Kernel.
这是我安装Anaconda后的所作所为:
conda create --name jupyter python=3
source activate jupyter
conda install jupyter
pip install --pre toree
jupyter toree install
Run Code Online (Sandbox Code Playgroud)
一切正常,直到我到达最后一行.我得到了
PermissionError: [Errno 13] Permission denied: '/usr/local/share/jupyter'
Run Code Online (Sandbox Code Playgroud)
这就引出了一个问题:为什么它甚至会查看该目录?毕竟它应该留在环境中.因此我感到满意
jupyter --paths
Run Code Online (Sandbox Code Playgroud)
得到
config:
/home/user/.jupyter
~/anaconda2/envs/jupyter/etc/jupyter
/usr/local/etc/jupyter
/etc/jupyter
data:
/home/user/.local/share/jupyter
~/anaconda2/envs/jupyter/share/jupyter
/usr/local/share/jupyter
/usr/share/jupyter
runtime:
/run/user/1000/jupyter
Run Code Online (Sandbox Code Playgroud)
我不太确定发生了什么,以及如何继续运行(如果可能的话)conda环境"jupyter".
我使用PySpark通过安装内核Apache Toree中Jupyter Notebook使用Anaconda v4.0.0(Python 2.7.11).获取表后Hive,使用matplotlib/panda在Jupyter笔记本中绘制一些图表,按照以下教程:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
normals = pd.Series(np.random.normal(size=10))
normals.plot()
Run Code Online (Sandbox Code Playgroud)
当我尝试使用%matplotlib内联显示时,我被困在第一个链接
Name: Error parsing magics!
Message: Magics [matplotlib] do not exist!
StackTrace:
Run Code Online (Sandbox Code Playgroud)
看着Toree Magic和MagicManager,我意识到这%matplotlib是在调用in MagicManager而不是iPythonin-build magic命令.
是否可以Apache Toree - PySpark使用iPython in-build …
我正在运行RHEL 6.7,并安装了Anaconda.(anaconda 4.10).Jupyter正在使用OOTB,默认情况下它有Python内核.一切都花花公子所以我可以在Jupyter中选择"python notebook".
我现在也想让Scala与Jupyter建立起来.(看起来像Spark内核 - 现在Toree会起作用吗?)
我见过的每一个问题/答案都没有引用我遇到的问题.
我试图安装Toree,并且做到了
sudo pip install toree
Run Code Online (Sandbox Code Playgroud)
它起作用了.但接下来的步骤也是如此
jupyter toree install
Run Code Online (Sandbox Code Playgroud)
而我得到的错误是:
jupyter toree install
Traceback (most recent call last):
File "/usr/app/anaconda/bin/jupyter-toree", line 7, in <module>
from toree.toreeapp import main
ImportError: No module named toree.toreeapp
Run Code Online (Sandbox Code Playgroud)
我错过了一步吗?我做错了什么?如果我需要提供更多信息,我也会很高兴.谢谢!
编辑:在Jupyter中获取Scala笔记本的标准/最简单/最可靠的方法是什么?(TL; DR)
我通常用以下方法启动我的火花壳:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.2.0,graphframes:graphframes:0.1.0-spark1.6,com.databricks:spark-avro_2.10:2.0.1
Run Code Online (Sandbox Code Playgroud)
我现在正在尝试使用Apache Toree,我应该如何在笔记本上加载这些库?
我尝试了以下方法:
jupyter toree install --user --spark_home=/home/eron/spark-1.6.1/ --spark_opts="--packages com.databricks:spark-csv_2.10:1.2.0,graphframes:graphframes:0.1.0-spark1.6,com.databricks:spark-avro_2.10:2.0.1"
Run Code Online (Sandbox Code Playgroud)
但这似乎不起作用
apache-toree ×10
apache-spark ×7
scala ×4
jupyter ×3
ipython ×2
pyspark ×2
python ×2
anaconda ×1
conda ×1
matplotlib ×1
windows ×1