Bil*_*qat 5 arrays struct user-defined-functions dataframe pyspark
d = [{'ID': '1', 'pID': 1000, 'startTime':'2018.07.02T03:34:20', 'endTime':'2018.07.03T02:40:20'}, {'ID': '1', 'pID': 1000, 'startTime':'2018.07.02T03:45:20', 'endTime':'2018.07.03T02:50:20'}, {'ID': '2', 'pID': 2000, 'startTime':'2018.07.02T03:34:20', 'endTime':'2018.07.03T02:40:20'}, {'ID': '2', 'pID': 2000, 'startTime':'2018.07.02T03:45:20', 'endTime':'2018.07.03T02:50:20'}]
df = spark.createDataFrame(d)
Dates = namedtuple("Dates", "startTime endTime")
def MergeAdjacentUsage(timeSets):
DatesArray = []
for times in timeSets:
DatesArray.append(Dates(startTime=times.startTime, endTime=times.endTime))
return DatesArray
MergeAdjacentUsages = udf(MergeAdjacentUsage,ArrayType(Dates()))
df1=df.groupBy(['ID','pID']).agg(MergeAdjacentUsages(F.collect_list(struct('startTime','endTime'))).alias("Times"))
display(df1)
Run Code Online (Sandbox Code Playgroud)
我想要的只是将列值设置为 UDF 返回的结构数组。它给我的错误是:
类型错误:new () 恰好需要 3 个参数(给定 1 个)
() 22 return DatesArray 23 ---> 24 MergeAdjacentUsages = udf(MergeAdjacentUsage,ArrayType(Dates())) 25 26 df1=df.groupBy(['ID','pID'] 中的 TypeError Traceback (最近一次调用最后一次) ).agg(MergeAdjacentUsages(F.collect_list(struct('startTime','endTime'))).alias("Times"))
任何帮助、想法或提示将不胜感激。
pyspark 不允许用户定义类对象作为数据框列类型。相反,我们需要创建StructType类似于 python 中的类/命名元组的使用方式。
例如:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.sql import functions as F
# from pyspark.sql.functions import *
d = [{'ID': '1', 'pID': 1000, 'startTime': '2018.07.02T03:34:20', 'endTime': '2018.07.03T02:40:20'},
{'ID': '1', 'pID': 1000, 'startTime': '2018.07.02T03:45:20', 'endTime': '2018.07.03T02:50:20'},
{'ID': '2', 'pID': 2000, 'startTime': '2018.07.02T03:34:20', 'endTime': '2018.07.03T02:40:20'},
{'ID': '2', 'pID': 2000, 'startTime': '2018.07.02T03:45:20', 'endTime': '2018.07.03T02:50:20'}]
df = spark.createDataFrame(d)
# Dates = namedtuple("Dates", "startTime endTime")
schema = ArrayType(StructType([
StructField("startTime", StringType(), False),
StructField("endTime", StringType(), False)
]))
MergeAdjacentUsages = udf(lambda xs: xs, schema)
df1 = df.groupBy(['ID', 'pID']).agg(MergeAdjacentUsages(
F.collect_list(F.struct('startTime', 'endTime'))).alias("Times"))
df1.show(truncate=False)
+---+----+----------------------------------------------------------------------------------------+
|ID |pID |Times |
+---+----+----------------------------------------------------------------------------------------+
|2 |2000|[[2018.07.02T03:34:20, 2018.07.03T02:40:20], [2018.07.02T03:45:20, 2018.07.03T02:50:20]]|
|1 |1000|[[2018.07.02T03:34:20, 2018.07.03T02:40:20], [2018.07.02T03:45:20, 2018.07.03T02:50:20]]|
+---+----+----------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助!