PySpark 对结构体数组进行排序

Wan*_*nda 5 python apache-spark pyspark

这是我的数据框的虚拟样本

data = [
    [3273, "city y", [["ids", 27], ["smf", 13], ["tlk", 35], ["thr", 24]]],
    [3213, "city x", [["smf", 23], ["tlk", 15], ["ids", 17], ["thr", 34]]],
]
df = spark.createDataFrame(
    data, "city_id:long, city_name:string, cel:array<struct<carr:string, subs:int>>"
)
df.show(2, False)

+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[[ids, 27], [smf, 13], [tlk, 35], [thr, 24]]|
|3213   |city x   |[[smf, 23], [tlk, 15], [ids, 17], [thr, 34]]|
+-------+---------+--------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

我需要根据其subs值对列cel的数组进行降序排序。会是这样的

+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[[tlk, 35], [ids, 27], [thr, 24], [smf, 13]]|
|3213   |city x   |[[thr, 34], [smf, 23], [ids, 17], [tlk, 15]]|
+-------+---------+--------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

如果可能的话,有没有办法在不使用 UDF 的情况下做到这一点?谢谢

我使用的是spark版本2.4.0

Ste*_*ven 6

您可以使用一些 SQL lambda 函数来完成此操作:

df = df.withColumn(
    "cel",
    F.expr(
        "reverse(array_sort(transform(cel,x->struct(x['subs'] as subs,x['carr'] as carr))))"
    ),
)

df.show()
+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[[35, tlk], [27, ids], [24, thr], [13, smf]]|
|3213   |city x   |[[34, thr], [23, smf], [17, ids], [15, tlk]]|
+-------+---------+--------------------------------------------+

df.printSchema()
root
 |-- city_id: long (nullable = true)
 |-- city_name: string (nullable = true)
 |-- cel: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- subs: integer (nullable = true)
 |    |    |-- carr: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)