All*_*ati 5 window sql-order-by pyspark
我通过一个例子来解释我的问题:
让我们假设我们有一个如下的数据框:
original_df = sc.createDataFrame([('x', 10,), ('x', 15,), ('x', 10,), ('x', 25,), ('y', 20,), ('y', 10,), ('y', 20,)], ["key", "price"] )
original_df.show()
Run Code Online (Sandbox Code Playgroud)
输出:
+---+-----+
|key|price|
+---+-----+
| x| 10|
| x| 15|
| x| 10|
| x| 25|
| y| 20|
| y| 10|
| y| 20|
+---+-----+
Run Code Online (Sandbox Code Playgroud)
并假设我想获取prices每个key使用的列表window:
w = Window.partitionBy('key')
original_df.withColumn('price_list', F.collect_list('price').over(w)).show()
Run Code Online (Sandbox Code Playgroud)
输出:
+---+-----+----------------+
|key|price| price_list|
+---+-----+----------------+
| x| 10|[10, 15, 10, 25]|
| x| 15|[10, 15, 10, 25]|
| x| 10|[10, 15, 10, 25]|
| x| 25|[10, 15, 10, 25]|
| y| 20| [20, 10, 20]|
| y| 10| [20, 10, 20]|
| y| 20| [20, 10, 20]|
+---+-----+----------------+
Run Code Online (Sandbox Code Playgroud)
到现在为止还挺好。
但是,如果我想获得一个有序列表,并将其添加orderBy到我的窗口中,w我会得到:
w = Window.partitionBy('key').orderBy('price')
original_df.withColumn('ordered_list', F.collect_list('price').over(w)).show()
Run Code Online (Sandbox Code Playgroud)
输出:
+---+-----+----------------+
|key|price| ordered_list|
+---+-----+----------------+
| x| 10| [10, 10]|
| x| 10| [10, 10]|
| x| 15| [10, 10, 15]|
| x| 25|[10, 10, 15, 25]|
| y| 10| [10]|
| y| 20| [10, 20, 20]|
| y| 20| [10, 20, 20]|
+---+-----+----------------+
Run Code Online (Sandbox Code Playgroud)
这意味着orderBy(某种)也更改rowsBetween了窗口中的行(与所做的相同)!这是不应该做的。
尽管我可以通过rowsBetween在窗口中指定来修复它并获得预期的结果,
w = Window.partitionBy('key').orderBy('price').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Run Code Online (Sandbox Code Playgroud)
有人可以解释为什么orderBy会window这样影响吗?
Man*_*ngh 13
Spark Window 使用三部分指定:分区、顺序和框架。
专门针对您的问题,orderBy 不仅可以对分区数据进行排序,还可以更改行框选择
下面是不同的 windowspec 和相应的输出
Window.orderBy()
+---+-----+----------------------------+
|key|price|price_list |
+---+-----+----------------------------+
|x |15 |[15, 10, 10, 20, 10, 25, 20]|
|x |10 |[15, 10, 10, 20, 10, 25, 20]|
|y |10 |[15, 10, 10, 20, 10, 25, 20]|
|y |20 |[15, 10, 10, 20, 10, 25, 20]|
|x |10 |[15, 10, 10, 20, 10, 25, 20]|
|x |25 |[15, 10, 10, 20, 10, 25, 20]|
|y |20 |[15, 10, 10, 20, 10, 25, 20]|
+---+-----+----------------------------+
Window.partitionBy('key')
+---+-----+----------------+
|key|price| price_list|
+---+-----+----------------+
| x| 15|[15, 10, 10, 25]|
| x| 10|[15, 10, 10, 25]|
| x| 10|[15, 10, 10, 25]|
| x| 25|[15, 10, 10, 25]|
| y| 20| [20, 10, 20]|
| y| 10| [20, 10, 20]|
| y| 20| [20, 10, 20]|
+---+-----+----------------+
Window.partitionBy('key').orderBy('price')
+---+-----+----------------+
|key|price| ordered_list|
+---+-----+----------------+
| x| 10| [10, 10]|
| x| 10| [10, 10]|
| x| 15| [10, 10, 15]|
| x| 25|[10, 10, 15, 25]|
| y| 10| [10]|
| y| 20| [10, 20, 20]|
| y| 20| [10, 20, 20]|
+---+-----+----------------+
w = Window.partitionBy('key').orderBy(F.desc('price'))
+---+-----+----------------+
|key|price| ordered_list|
+---+-----+----------------+
| x| 25| [25]|
| x| 15| [25, 15]|
| x| 10|[25, 15, 10, 10]|
| x| 10|[25, 15, 10, 10]|
| y| 20| [20, 20]|
| y| 20| [20, 20]|
| y| 10| [20, 20, 10]|
+---+-----+----------------+
Window.partitionBy('key').orderBy('price').rowsBetween(Window.unboundedPreceding, Window.currentRow)
+---+-----+----------------+
|key|price| ordered_list|
+---+-----+----------------+
| x| 10| [10]|
| x| 10| [10, 10]|
| x| 15| [10, 10, 15]|
| x| 25|[10, 10, 15, 25]|
| y| 10| [10]|
| y| 20| [10, 20]|
| y| 20| [10, 20, 20]|
+---+-----+----------------+
Window.partitionBy('key').rowsBetween(Window.unboundedPreceding, Window.currentRow)
+---+-----+----------------+
|key|price| ordered_list|
+---+-----+----------------+
| x| 15| [15]|
| x| 10| [15, 10]|
| x| 10| [15, 10, 10]|
| x| 25|[15, 10, 10, 25]|
| y| 10| [10]|
| y| 20| [10, 20]|
| y| 20| [10, 20, 20]|
+---+-----+----------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4611 次 |
| 最近记录: |