Bal*_*ala 16 dataframe pandas apache-spark apache-zeppelin
我是齐柏林飞艇的新手.我有一个用例,其中我有一个pandas数据帧.我需要使用zeppelin的内置图表来可视化集合我在这里没有明确的方法.我的理解是使用zeppelin,如果它是RDD格式,我们可以将数据可视化.所以,我想将pandas dataframe转换为spark数据帧,然后进行一些查询(使用sql),我会想象.首先,我试图将pandas数据帧转换为spark,但我失败了
%pyspark
import pandas as pd
from pyspark.sql import SQLContext
print sc
df = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print type(df)
print df
sqlCtx = SQLContext(sc)
sqlCtx.createDataFrame(df).show()
Run Code Online (Sandbox Code Playgroud)
我得到了以下错误
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py",
line 162, in <module> eval(compiledCode) File "<string>",
line 8, in <module> File "/home/bala/Software/spark-1.5.0-bin-hadoop2.6/python/pyspark/sql/context.py",
line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/home/bala/Software/spark-1.5.0-bin-hadoop2.6/python/pyspark/sql/context.py",
line 322, in _createFromLocal struct = self._inferSchemaFromList(data) File "/home/bala/Software/spark-1.5.0-bin-hadoop2.6/python/pyspark/sql/context.py",
line 211, in _inferSchemaFromList schema = _infer_schema(first) File "/home/bala/Software/spark-1.5.0-bin-hadoop2.6/python/pyspark/sql/types.py",
line 829, in _infer_schema raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'str'>
Run Code Online (Sandbox Code Playgroud)
有人可以帮帮我吗?如果我在任何地方都错了,请纠正我.
edd*_*ies 14
以下适用于Zeppelin 0.6.0,Spark 1.6.2和Python 3.5.2:
%pyspark
import pandas as pd
df = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
z.show(sqlContext.createDataFrame(df))
Run Code Online (Sandbox Code Playgroud)
其呈现为:

小智 7
我刚刚将你的代码复制并粘贴在笔记本中,它可以工作.
%pyspark
import pandas as pd
from pyspark.sql import SQLContext
print sc
df = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print type(df)
print df
sqlCtx = SQLContext(sc)
sqlCtx.createDataFrame(df).show()
<pyspark.context.SparkContext object at 0x10b0a2b10>
<class 'pandas.core.frame.DataFrame'>
k v
0 foo 1
1 bar 2
+---+-+
| k|v|
+---+-+
|foo|1|
|bar|2|
+---+-+
Run Code Online (Sandbox Code Playgroud)
我正在使用这个版本:zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz
| 归档时间: |
|
| 查看次数: |
33304 次 |
| 最近记录: |