按行索引拆分Spark数据帧

COS*_*STA 1 python apache-spark pyspark

我想按行顺序拆分数据框.如果有100行,则所需的分割成4个相等的数据帧应该有指数0-24,25-49,50-74,和75-99,分别.

唯一可用的预定义功能是randomSplit.但是randomSplit在拆分之前随机化数据.我想到的另一种方法是使用count缩减操作找到数据计数,然后继续使用数据提取 take但是它非常昂贵.在保持相同顺序的同时,还有其他方法可以实现上述目标吗?

ktd*_*drv 6

您可以使用monotonically_increasing_id获取行号(如果您还没有),然后ntile通过行号窗口分割成您想要的多个分区:

from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile

values = [(str(i),) for i in range(100)]
df = spark.createDataFrame(values, ('value',))

def split_by_row_index(df, num_partitions=4):
    # Let's assume you don't have a row_id column that has the row order
    t = df.withColumn('_row_id', monotonically_increasing_id())
    # Using ntile() because monotonically_increasing_id is discontinuous across partitions
    t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id))) 
    return [t.filter(t._partition == i+1).drop('_row_id', '_partition') for i in range(partitions)]

[i.collect() for i in split_by_row_index(df)]
Run Code Online (Sandbox Code Playgroud)