Aka*_*all 4 apache-spark pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("myApp").setMaster("local")
sc = SparkContext(conf=conf)
a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])
a.show()
Run Code Online (Sandbox Code Playgroud)
结果是:
Traceback (most recent call last):
File "/Users/ktemlyakov/messing_around/SparkStuff/mock_maersk_data.py", line 7, in <module>
a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])
AttributeError: 'RDD' object has no attribute 'toDF'
Run Code Online (Sandbox Code Playgroud)
我错过了什么?
sqlContext不见了; 它需要被创建。以下代码有效:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])
a.show()
Run Code Online (Sandbox Code Playgroud)
编辑:
在 Spark 2.0 中,以上可以通过以下方式实现:
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()
a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
a.show()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
15653 次 |
| 最近记录: |