5 python apache-spark apache-spark-sql pyspark
我想在这个数据框中找到任务列的模式:
+-----+-----------------------------------------+
| id | task |
+-----+-----------------------------------------+
| 101 | [person1, person1, person3] |
| 102 | [person1, person2, person3] |
| 103 | null |
| 104 | [person1, person2] |
| 105 | [person1, person1, person2, person2] |
| 106 | null |
+-----+-----------------------------------------+
Run Code Online (Sandbox Code Playgroud)
如果有多种模式,我想显示所有模式。
有人可以帮我得到这个输出:
+-----+-----------------------------------------+---------------------------+
| id | task | mode |
+-----+-----------------------------------------+---------------------------+
| 101 | [person1, person1, person3] |[person1] |
| 102 | [person1, person2, person3] |[person1, person2, person3]|
| 103 | null |[] |
| 104 | [person1, person2] |[person1, person2] |
| 105 | [person1, person1, person2, person2] |[person1, person2] |
| 106 | null |[] |
+-----+-----------------------------------------+---------------------------+
Run Code Online (Sandbox Code Playgroud)
这是我在这里的第一个问题。非常感谢任何帮助或提示。谢谢你。
使用 Spark 2.3:
您可以使用自定义解决此问题UDF。为了获取多个模式值,我使用了Counter. 我将exceptUDF 中的块用于task列中的 null 情况。
(对于 Python 3.8+ 用户,您可以使用一个statistics.multimode()内置函数)
您的数据框:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import *
schema = StructType([StructField("id", IntegerType()), StructField("task", ArrayType(StringType()))])
data = [[101, ["person1", "person1", "person3"]], [102, ["person1", "person2", "person3"]], [103, None], [104, ["person1", "person2"]], [105, ["person1", "person1", "person2", "person2"]], [106, None]]
df = spark.createDataFrame(data,schema=schema)
Run Code Online (Sandbox Code Playgroud)
手术:
from collections import Counter
def get_multi_mode_list(input_array):
multi_mode = []
counter_var = Counter(input_array)
try:
temp = counter_var.most_common(1)[0][1]
except:
temp = counter_var.most_common(1)
for i in counter_var:
if input_array.count(i) == temp:
multi_mode.append(i)
return(list(set(multi_mode)))
get_multi_mode_list_udf = F.udf(get_multi_mode_list, ArrayType(StringType()))
df.withColumn("multi_mode", get_multi_mode_list_udf(col("task"))).show(truncate=False)
Run Code Online (Sandbox Code Playgroud)
输出:
+---+------------------------------------+---------------------------+
|id |task |multi_mode |
+---+------------------------------------+---------------------------+
|101|[person1, person1, person3] |[person1] |
|102|[person1, person2, person3] |[person2, person3, person1]|
|103|null |[] |
|104|[person1, person2] |[person2, person1] |
|105|[person1, person1, person2, person2]|[person2, person1] |
|106|null |[] |
+---+------------------------------------+---------------------------+
Run Code Online (Sandbox Code Playgroud)