Sai*_*kat 1 apache-spark apache-spark-sql pyspark pyspark-dataframes
我有一个数据框,如下所示:
+-----+------------------------+
|Index| finalArray |
+-----+------------------------+
|1 |[0, 2, 0, 3, 1, 4, 2, 7]|
|2 |[0, 4, 4, 3, 4, 2, 2, 5]|
+-----+------------------------+
Run Code Online (Sandbox Code Playgroud)
我想将数组分成 2 个块,然后找到每个块的总和并将结果数组存储在列 finalArray 中。它将如下所示:
+-----+---------------------+
|Index| finalArray |
+-----+---------------------+
|1 |[2, 3, 5, 9] |
|2 |[4, 7, 6, 7] |
+-----+---------------------+
Run Code Online (Sandbox Code Playgroud)
我可以通过创建 UDF 但寻找更好和优化的方法来做到这一点。如果我可以使用 withColumn 并传递 flagArray 来处理它,而不必编写 UDF,则最好。
@udf(ArrayType(DoubleType()))
def aggregate(finalArray,chunkSize):
n = int(chunkSize)
aggsum = []
final = [finalArray[i * n:(i + 1) * n] for i in range((len(finalArray) + n - 1) // n )]
for item in final:
agg = 0
for j in item:
agg += j
aggsum.append(agg)
return aggsum
Run Code Online (Sandbox Code Playgroud)
我无法在 UDF 中使用以下表达式,因此我使用了循环
[sum(finalArray[x:x+2]) for x in range(0, len(finalArray), chunkSize)]
Run Code Online (Sandbox Code Playgroud)
from pyspark.sql.function import expr
df = spark.createDataFrame([
(1, [0, 2, 0, 3, 1, 4, 2, 7]),
(2, [0, 4, 4, 3, 4, 2, 2, 5])
], ["Index", "finalArray"])
df.withColumn("finalArray", expr("""
transform(
sequence(0,ceil(size(finalArray)/2)-1),
i -> finalArray[2*i] + ifnull(finalArray[2*i+1],0))
""")).show(truncate=False)
+-----+------------+
|Index|finalArray |
+-----+------------+
|1 |[2, 3, 5, 9]|
|2 |[4, 7, 6, 7]|
+-----+------------+
Run Code Online (Sandbox Code Playgroud)
对于任意 N 的块大小,使用聚合函数进行小计:
N = 3
sql_expr = """
transform(
/* create a sequence from 0 to number_of_chunks-1 */
sequence(0,ceil(size(finalArray)/{0})-1),
/* iterate the above sequence */
i ->
/* create a sequence from 0 to chunk_size-1
calculate the sum of values containing every chunk_size items by their indices
*/
aggregate(
sequence(0,{0}-1),
0L,
(acc, y) -> acc + ifnull(finalArray[i*{0}+y],0)
)
)
"""
df.withColumn("finalArray", expr(sql_expr.format(N))).show()
+-----+----------+
|Index|finalArray|
+-----+----------+
| 1| [2, 8, 9]|
| 2| [8, 9, 7]|
+-----+----------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
542 次 |
| 最近记录: |