我有这个数据框
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store", "values"])
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
Run Code Online (Sandbox Code Playgroud)
我想转换成以下df:
+-----+------------------+
|store| values |
+-----+------------------+
| 1|[1, 2, 3, 4, 5, 6]|
| 2| [2, 3]|
+-----+------------------+
Run Code Online (Sandbox Code Playgroud)
我这样做了:
from pyspark.sql import functions as F
df.groupBy("store").agg(F.collect_list("values"))
Run Code Online (Sandbox Code Playgroud)
但解决方案有这个 WrappedArrays
+-----+----------------------------------------------+
|store|collect_list(values) |
+-----+----------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|
|2 |[WrappedArray(2), WrappedArray(3)] |
+-----+----------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
有没有办法将其转换WrappedArrays为连续数组?或者我可以采用不同的方式吗? …