Pyspark：用零填充数组 [Int] 列

Question

Pyspark：用零填充数组 [Int] 列

我在 pyspark 数据框中有以下列，类型为 Array[Int]。

+--------------------+
|     feature_indices|
+--------------------+
|                 [0]|
|[0, 1, 4, 10, 11,...|
|           [0, 1, 2]|
|                 [1]|
|                 [0]|
+--------------------+

Run Code Online (Sandbox Code Playgroud)

我试图用零填充数组，然后限制列表长度，以便每行数组的长度相同。例如，对于 n = 5，我期望：

+--------------------+
|     feature_indices|
+--------------------+
|     [0, 0, 0, 0, 0]|
|   [0, 1, 4, 10, 11]|
|     [0, 1, 2, 0, 0]|
|     [1, 0, 0, 0, 0]|
|     [0, 0, 0, 0, 0]|
+--------------------+

Run Code Online (Sandbox Code Playgroud)

有什么建议么？我查看了 pysparkrpad函数，但它仅对字符串类型列进行操作。

Answer 1

Psi*_*dom 5

你可以写一个udf来做到这一点：

from pyspark.sql.types import ArrayType, IntegerType
import pyspark.sql.functions as F

pad_fix_length = F.udf(
    lambda arr: arr[:5] + [0] * (5 - len(arr[:5])), 
    ArrayType(IntegerType())
)

df.withColumn('feature_indices', pad_fix_length(df.feature_indices)).show()
+-----------------+
|  feature_indices|
+-----------------+
|  [0, 0, 0, 0, 0]|
|[0, 1, 4, 10, 11]|
|  [0, 1, 2, 0, 0]|
|  [1, 0, 0, 0, 0]|
|  [0, 0, 0, 0, 0]|
+-----------------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，8 月前
查看次数：	3007 次
最近记录：	7 年，8 月前