Mpi*_*ris 5 python apache-spark pyspark spark-dataframe
假设我有以下数据帧:
dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)]
df = sc.parallelize(dummy_data).toDF(['letter','number'])
Run Code Online (Sandbox Code Playgroud)
我想创建以下数据帧:
[('a',0),('b',2),('c',1),('d',3),('e',0)]
Run Code Online (Sandbox Code Playgroud)
我所做的是将其转换为rdd使用zipWithIndex函数并在加入结果后:
convertDF = (df.select('number')
.distinct()
.rdd
.zipWithIndex()
.map(lambda x:(x[0].number,x[1]))
.toDF(['old','new']))
finalDF = (df
.join(convertDF,df.number == convertDF.old)
.select(df.letter,convertDF.new))
Run Code Online (Sandbox Code Playgroud)
是否存在与zipWIthIndex数据帧类似的功能?还有另一种更有效的方法来完成这项任务吗?
请检查https://issues.apache.org/jira/browse/SPARK-23074以了解数据帧中的此直接功能奇偶校验。
这是 PySpark 中的一种解决方法:
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
Run Code Online (Sandbox Code Playgroud)
这也有鲍鱼包装。