prk*_*prk 5 window-functions apache-spark pyspark
我的问题类似于这个线程: 在Spark SQL中按多列分区
但是我在Pyspark而不是Scala工作,我想把列表中的列列表传递给我.我想做这样的事情:
column_list = ["col1","col2"]
win_spec = Window.partitionBy(column_list)
Run Code Online (Sandbox Code Playgroud)
我可以得到以下工作:
win_spec = Window.partitionBy(col("col1"))
Run Code Online (Sandbox Code Playgroud)
这也有效:
col_name = "col1"
win_spec = Window.partitionBy(col(col_name))
Run Code Online (Sandbox Code Playgroud)
这也有效:
win_spec = Window.partitionBy([col("col1"), col("col2")])
Run Code Online (Sandbox Code Playgroud)
使用列表推导将列名转换为列表达式[col(x) for x in column_list]:
from pyspark.sql.functions import col
column_list = ["col1","col2"]
win_spec = Window.partitionBy([col(x) for x in column_list])
Run Code Online (Sandbox Code Playgroud)
小智 8
PySpark >= 2.4,这也有效 =>
column_list = ["col1","col2"]
win_spec = Window.partitionBy(*column_list)
Run Code Online (Sandbox Code Playgroud)
你的第一次尝试应该会成功。
考虑以下示例:
import pyspark.sql.functions as f
from pyspark.sql import Window
df = sqlCtx.createDataFrame(
[
("a", "apple", 1),
("a", "orange", 2),
("a", "orange", 3),
("b", "orange", 3),
("b", "orange", 5)
],
["name", "fruit","value"]
)
df.show()
#+----+------+-----+
#|name| fruit|value|
#+----+------+-----+
#| a| apple| 1|
#| a|orange| 2|
#| a|orange| 3|
#| b|orange| 3|
#| b|orange| 5|
#+----+------+-----+
Run Code Online (Sandbox Code Playgroud)
假设您想要计算每行总和的一部分,并按前两列分组:
cols = ["name", "fruit"]
w = Window.partitionBy(cols)
df.select(cols + [(f.col('value') / f.sum('value').over(w)).alias('fraction')]).show()
#+----+------+--------+
#|name| fruit|fraction|
#+----+------+--------+
#| a| apple| 1.0|
#| b|orange| 0.375|
#| b|orange| 0.625|
#| a|orange| 0.6|
#| a|orange| 0.4|
#+----+------+--------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
8505 次 |
| 最近记录: |