如何在 Pyspark 中找到数组列的多模式

5 python apache-spark apache-spark-sql pyspark

我想在这个数据框中找到任务列的模式:

+-----+-----------------------------------------+
|  id |              task                       |
+-----+-----------------------------------------+
| 101 |   [person1, person1, person3]           |
| 102 |   [person1, person2, person3]           |
| 103 |           null                          |
| 104 |   [person1, person2]                    |
| 105 |   [person1, person1, person2, person2]  |
| 106 |           null                          |
+-----+-----------------------------------------+
Run Code Online (Sandbox Code Playgroud)

如果有多种模式,我想显示所有模式。
有人可以帮我得到这个输出:

+-----+-----------------------------------------+---------------------------+
|  id |              task                       |           mode            |
+-----+-----------------------------------------+---------------------------+
| 101 |   [person1, person1, person3]           |[person1]                  |
| 102 |   [person1, person2, person3]           |[person1, person2, person3]|
| 103 |           null                          |[]                         |
| 104 |   [person1, person2]                    |[person1, person2]         |
| 105 |   [person1, person1, person2, person2]  |[person1, person2]         |
| 106 |           null                          |[]                         |
+-----+-----------------------------------------+---------------------------+
Run Code Online (Sandbox Code Playgroud)

这是我在这里的第一个问题。非常感谢任何帮助或提示。谢谢你。

Sur*_*ali 0

使用 Spark 2.3:

您可以使用自定义解决此问题UDF。为了获取多个模式值,我使用了Counter. 我将exceptUDF 中的块用于task列中的 null 情况。
(对于 Python 3.8+ 用户,您可以使用一个statistics.multimode()内置函数)

您的数据框:

from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import *

schema = StructType([StructField("id", IntegerType()), StructField("task", ArrayType(StringType()))])
data = [[101, ["person1", "person1", "person3"]], [102, ["person1", "person2", "person3"]], [103, None], [104, ["person1", "person2"]], [105, ["person1", "person1", "person2", "person2"]], [106, None]]

df = spark.createDataFrame(data,schema=schema)
Run Code Online (Sandbox Code Playgroud)

手术:

from collections import Counter

def get_multi_mode_list(input_array):
    multi_mode = []
    counter_var = Counter(input_array)  
    try:
        temp = counter_var.most_common(1)[0][1]
    except:
        temp = counter_var.most_common(1)
    for i in counter_var: 
        if input_array.count(i) == temp: 
            multi_mode.append(i)
    return(list(set(multi_mode)))


get_multi_mode_list_udf = F.udf(get_multi_mode_list, ArrayType(StringType()))

df.withColumn("multi_mode", get_multi_mode_list_udf(col("task"))).show(truncate=False)
Run Code Online (Sandbox Code Playgroud)

输出:

+---+------------------------------------+---------------------------+
|id |task                                |multi_mode                 |
+---+------------------------------------+---------------------------+
|101|[person1, person1, person3]         |[person1]                  |
|102|[person1, person2, person3]         |[person2, person3, person1]|
|103|null                                |[]                         |
|104|[person1, person2]                  |[person2, person1]         |
|105|[person1, person1, person2, person2]|[person2, person1]         |
|106|null                                |[]                         |
+---+------------------------------------+---------------------------+
Run Code Online (Sandbox Code Playgroud)