从Spark 1.5.0开始,似乎可以编写自己的UDAF用于DataFrames上的自定义聚合: Spark 1.5 DataFrame API要点:日期/时间/字符串处理,时间间隔和UDAF
但是,我不清楚Python API是否支持此功能?
我试图合并两列不同的数据类型.在下面的代码片段中,为了简单起见,我从同一数据框中选择列.
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
a=sc.parallelize([('ship1',datetime(2015,1,1),2,3.,4.),('ship1',datetime(2015,1,2),4,8.,9.),('ship1',datetime(2015,1,3),5,39.,49.),('ship2',datetime(2015,1,4),2,3.,4.),('ship2',datetime(2015,1,5),4,4.,6.),('ship3',datetime(2015,1,15),33,56.,6.),('ship3',datetime(2015,1,12),3,566.,64.),('ship4',datetime(2015,1,5),3,3.,None)])
schemaString = "name time ROT SOG COG"
strtype=[StringType(),TimestampType(),IntegerType(),FloatType(),FloatType()]
fields = [StructField(schemaString.split()[i], strtype[i],True) for i in range(0,len(strtype))]
schema=StructType(fields)
df=sqlContext.createDataFrame(a,schema)
df.show()
+-----+--------------------+---+-----+----+
| name| time|ROT| SOG| COG|
+-----+--------------------+---+-----+----+
|ship1|2015-01-01 00:00:...| 2| 3.0| 4.0|
|ship1|2015-01-02 00:00:...| 4| 8.0| 9.0|
|ship1|2015-01-03 00:00:...| 5| 39.0|49.0|
|ship2|2015-01-04 00:00:...| 2| 3.0| 4.0|
|ship2|2015-01-05 00:00:...| 4| 4.0| 6.0|
|ship3|2015-01-15 00:00:...| 33| 56.0| 6.0|
|ship3|2015-01-12 00:00:...| 3|566.0|64.0|
|ship4|2015-01-05 00:00:...| 3| 3.0|null|
+-----+--------------------+---+-----+----+
Run Code Online (Sandbox Code Playgroud)
当我从df中提取两列时,到新的DataFrames并尝试将它们与df.withColumn()合并
b=df.select("time") …Run Code Online (Sandbox Code Playgroud)