从数组元素中删除子字符串并复制 pyspark

Question

从数组元素中删除子字符串并复制 pyspark

ver*_*cla 2 apache-spark pyspark pyspark-sql pyspark-dataframes

我有一个 pyspark 数据框：

number  |  matricule      
--------------------------------------------
1       |  ["AZ 1234", "1234", "00100"]                   
--------------------------------------------
23      |  ["1010", "12987"]                   
--------------------------------------------
56      |  ["AZ 98989", "22222", "98989"]                   
--------------------------------------------

Run Code Online (Sandbox Code Playgroud)

在matricule数组中，如果我删除AZ字符串，我会有重复的值。我想删除"AZ"字符串然后删除matricule 数组中的重复值。知道有时我在之后有一个空格AZ，我也应该将其删除。

我做了一个udf：

def remove_AZ(A)
    for item in A:
        if item.startswith('AZ'):
            item.replace('AZ','')
udf_remove_AZ = F.udf(remove_AZ)
df = df.withColumn("AZ_2", udf_remove_AZ(df.matricule))

Run Code Online (Sandbox Code Playgroud)

我在所有AZ_2列中都为空。

如何从matricule数组中的每个值中删除 AZ然后删除里面的重复项？谢谢

Answer 1

bla*_*hop 6

对于 Spark 2.4+，您可以像这样使用transform+array_distinct函数：

t = "transform(matricule, x -> trim(regexp_replace(x, '^AZ', '')))"
df.withColumn("matricule", array_distinct(expr(t))).show(truncate=False) 

#+------+--------------+
#|number|matricule     |
#+------+--------------+
#|1     |[1234, 00100] |
#|23    |[1010, 12987] |
#|56    |[98989, 22222]|
#+------+--------------+

Run Code Online (Sandbox Code Playgroud)

对于数组的每个元素 using transform，我们AZ从字符串的开头删除字符 usingregexp_replace以及trim前导和尾随空格（如果有）。

归档时间：	5 年，10 月前
查看次数：	549 次
最近记录：	5 年，10 月前