DataFrame-ified zipWithIndex

Dav*_*fin 34 apache-spark apache-spark-sql

我试图解决向数据集添加序列号的古老问题.我正在使用DataFrames,似乎没有相应的DataFrame RDD.zipWithIndex.另一方面,以下工作或多或少按我希望的方式工作:

val origDF = sqlContext.load(...)    

val seqDF= sqlContext.createDataFrame(
    origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)),
    StructType(Array(StructField("seq", LongType, false)) ++ origDF.schema.fields)
)
Run Code Online (Sandbox Code Playgroud)

在我的实际应用程序中,origDF不会直接从文件中加载 - 它将通过将2-3个其他DataFrame连接在一起而创建,并将包含超过1亿行.

有一个更好的方法吗?我该怎么做才能优化它?

Kir*_*rst 34

以下内容是代表David Griffin发布的(无法编辑).

全唱,全舞dfZipWithIndex方法.您可以设置起始偏移量(默认为1),索引列名称(默认为"id"),并将列放在前面或后面:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.Row


def dfZipWithIndex(
  df: DataFrame,
  offset: Int = 1,
  colName: String = "id",
  inFront: Boolean = true
) : DataFrame = {
  df.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map(ln =>
      Row.fromSeq(
        (if (inFront) Seq(ln._2 + offset) else Seq())
          ++ ln._1.toSeq ++
        (if (inFront) Seq() else Seq(ln._2 + offset))
      )
    ),
    StructType(
      (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
        ++ df.schema.fields ++ 
      (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
    )
  ) 
}
Run Code Online (Sandbox Code Playgroud)


Evg*_*tov 11

从Spark 1.6开始,有一个名为monotonically_increasing_id()的函数.
它为每一行生成一个具有唯一64位单调索引的新列.
但它不是重要的,每个分区都会启动一个新范围,因此我们必须在使用它之前计算每个分区偏移量.
试图提供一个"rdd-free"解决方案,我最终得到了一些collect(),但它只收集偏移量,每个分区一个值,所以它不会导致OOM

def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())

    val partitionOffsets = dfWithPartitionId
        .groupBy("partition_id")
        .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
        .orderBy("partition_id")
        .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
        .collect()
        .map(_.getLong(0))
        .toArray

     dfWithPartitionId
        .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
        .withColumn(indexName, col("partition_offset") + col("inc_id"))
        .drop("partition_id", "partition_offset", "inc_id")
}
Run Code Online (Sandbox Code Playgroud)

这个解决方案没有重新打包原始行,也没有重新分配原始的巨大数据帧,因此它在现实世界中非常快:200GB的CSV数据(4300万行,150列)在2分钟内读取,索引并打包到镶木地板在240核心上
测试我的解决方案后,我运行了Kirk Broadhurst的解决方案,它慢了20秒
你可能想要或不想使用dfWithPartitionId.cache(),取决于任务


Dav*_*fin 8

从Spark 1.5开始,Window表达式被添加到Spark.而不必在转换的DataFrame一个RDD,你现在可以使用org.apache.spark.sql.expressions.row_number.请注意,我发现上面的性能dfZipWithIndex明显快于下面的算法.但我发布它是因为:

  1. 其他人会试图尝试这个
  2. 也许有人可以优化下面的表达式

无论如何,这对我有用:

import org.apache.spark.sql.expressions._

df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
Run Code Online (Sandbox Code Playgroud)

请注意,我lit(1)用于分区和排序 - 这使得所有内容都在同一个分区中,并且似乎保留了原始顺序DataFrame,但我认为这会减慢它的速度.

我在一个DataFrame有7,000,000行的4列上进行了测试,速度差异在这个和上面之间是显着的dfZipWithIndex(就像我说的,RDD功能更快,更快).

  • 如果数据集不适合单个工作人员的内存,这不会导致OOM错误吗? (3认同)

Tag*_*gar 5

PySpark 版本:

from pyspark.sql.types import LongType, StructField, StructType

def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe 
        and preserves a schema

        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''

    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )

    zipped_rdd = df.rdd.zipWithIndex()

    new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))

    return spark.createDataFrame(new_rdd, new_schema)
Run Code Online (Sandbox Code Playgroud)

还创建了一个 jira 以在 Spark 本地添加此功能:https : //issues.apache.org/jira/browse/SPARK-23074


小智 5

@Evgeny,你的解决方案很有趣。请注意,当您有空分区时会出现错误(数组缺少这些分区索引,至少在我的 Spark 1.6 中发生了这种情况),因此我将数组转换为 Map(partitionId -> offsets)。

另外,我取出了 monotonically_increasing_id 的来源,让每个分区中的“inc_id”从 0 开始。

这是更新版本:

import org.apache.spark.sql.catalyst.expressions.LeafExpression
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.catalyst.expressions.Nondeterministic
import org.apache.spark.sql.catalyst.expressions.codegen.GeneratedExpressionCode
import org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.expressions.Window

case class PartitionMonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {

  /**
   * From org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
   *
   * Record ID within each partition. By being transient, count's value is reset to 0 every time
   * we serialize and deserialize and initialize it.
   */
  @transient private[this] var count: Long = _

  override protected def initInternal(): Unit = {
    count = 1L // notice this starts at 1, not 0 as in org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
  }

  override def nullable: Boolean = false

  override def dataType: DataType = LongType

  override protected def evalInternal(input: InternalRow): Long = {
    val currentCount = count
    count += 1
    currentCount
  }

  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
    val countTerm = ctx.freshName("count")
    ctx.addMutableState(ctx.JAVA_LONG, countTerm, s"$countTerm = 1L;")
    ev.isNull = "false"
    s"""
      final ${ctx.javaType(dataType)} ${ev.value} = $countTerm;
      $countTerm++;
    """
  }
}

object DataframeUtils {
  def zipWithIndex(df: DataFrame, offset: Long = 0, indexName: String = "index") = {
    // from /sf/ask/2121336731/)
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", new Column(PartitionMonotonicallyIncreasingID()))

    // collect each partition size, create the offset pages
    val partitionOffsets: Map[Int, Long] = dfWithPartitionId
      .groupBy("partition_id")
      .agg(max("inc_id") as "cnt") // in each partition, count(inc_id) is equal to max(inc_id) (I don't know which one would be faster)
      .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") + lit(offset) as "cnt")
      .collect()
      .map(r => (r.getInt(0) -> r.getLong(1)))
      .toMap

    def partition_offset(partitionId: Int): Long = partitionOffsets(partitionId)
    val partition_offset_udf = udf(partition_offset _)
    // and re-number the index
    dfWithPartitionId
      .withColumn("partition_offset", partition_offset_udf(col("partition_id")))
      .withColumn(indexName, col("partition_offset") + col("inc_id"))
      .drop("partition_id")
      .drop("partition_offset")
      .drop("inc_id")
  }
}
Run Code Online (Sandbox Code Playgroud)