相关疑难解决方法(0)

避免Spark窗口函数中单个分区模式的性能影响

我的问题是由计算spark数据帧中连续行之间差异的用例触发的.

例如,我有:

>>> df.show()
+-----+----------+
|index|      col1|
+-----+----------+
|  0.0|0.58734024|
|  1.0|0.67304325|
|  2.0|0.85154736|
|  3.0| 0.5449719|
+-----+----------+
Run Code Online (Sandbox Code Playgroud)

如果我选择使用"Window"函数计算它们,那么我可以这样做:

>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index|      col1| diffs_col1|
+-----+----------+-----------+
|  0.0|0.58734024|0.085703015|
|  1.0|0.67304325| 0.17850411|
|  2.0|0.85154736|-0.30657548|
|  3.0| 0.5449719|       null|
+-----+----------+-----------+
Run Code Online (Sandbox Code Playgroud)

问题:我在一个分区中明确地划分了数据帧.这会对性能产生什么影响,如果存在,为什么会这样,我怎么能避免它呢?因为当我没有指定分区时,我收到以下警告:

16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Run Code Online (Sandbox Code Playgroud)

partitioning window-functions apache-spark apache-spark-sql pyspark

15
推荐指数
1
解决办法
9714
查看次数