Pyspark Dataframes 作为视图

Question

Pyspark Dataframes 作为视图

Jos*_*osh 3 sql view apache-spark-sql pyspark

对于我正在运行的脚本，我有一堆链式视图，用于查看 sql 中的一组特定数据（我正在使用 Apache Spark SQL）：

%sql
create view view_1 as
select column_1,column_2 from original_data_table

Run Code Online (Sandbox Code Playgroud)

这个逻辑最终达到了view_n。然而，我随后需要执行在 sql 中难以（或不可能）实现的逻辑，具体来说，命令explode：

%python
df_1 = sqlContext.sql("SELECT * from view_n")
df1_exploded=df_1.withColumn("exploded_column", explode(split(df_1f.col_to_explode,',')))

Run Code Online (Sandbox Code Playgroud)

我的问题：

在 sql 表和 pyspark 数据帧之间切换是否存在速度成本？或者，由于 pyspark 数据帧是延迟评估的，它与视图非常相似吗？
有没有更好的方法从 sql 表切换到 pyspark 数据帧？

Answer 1

the*_*hon 6

您可以explode()通过 Spark SQL 使用 DF 拥有的任何内容（https://spark.apache.org/docs/latest/api/sql/index.html）

print(spark.version)
2.4.3

df = spark.createDataFrame([(1, [1,2,3]), (2, [4,5,6]), (3, [7,8,9]),],["id", "nest"])
df.printSchema()

root
 |-- id: long (nullable = true)
 |-- nest: array (nullable = true)
 |    |-- element: long (containsNull = true)

df.createOrReplaceTempView("sql_view")
spark.sql("SELECT id, explode(nest) as un_nest FROM sql_view").show()

df.createOrReplaceTempView("sql_view")
spark.sql("SELECT id, explode(nest) as flatten FROM sql_view").show()

+---+-------+
| id|flatten|
+---+-------+
|  1|      1|
|  1|      2|
|  1|      3|
|  2|      4|
|  2|      5|
|  2|      6|
|  3|      7|
|  3|      8|
|  3|      9|
+---+-------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，6 月前
查看次数：	18736 次
最近记录：	6 年，6 月前