获取两个版本的 Delta Lake 表之间的差异

Question

获取两个版本的 Delta Lake 表之间的差异

Ism*_*l H 5 scala apache-spark delta-lake

如何找到增量表的两个最新版本之间的差异？这是我使用数据框的情况：

val df1 = spark.read
  .format("delta")
  .option("versionAsOf", "0001")
  .load("/path/to/my/table")

val df2 = spark.read
  .format("delta")
  .option("versionAsOf", "0002")
  .load("/path/to/my/table")

// non idiomatic way to do it ...
df1.unionAll(df2).except(df1.intersect(df2))

Run Code Online (Sandbox Code Playgroud)

Databricks有一个 Delta 的商业版本，它提供了一个名为 CDF 的解决方案，但我正在寻找一个开源替代方案

Answer 1

Ism*_*l H 1

使用@Zinking的评论，我设法获得了一个数据框，其中计算了两个版本之间的差异：

1）获取最新版本：

val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
    .history()
    .select(col("version"))
    .collect.toList
    .headOption
    .getOrElse(throw new Exception("Is this table empty ?"))

Run Code Online (Sandbox Code Playgroud)

2) 获取特定版本中标记为“添加”或“删除”的 parquet 文件列表0000NUMVERSION：

val addPathList = spark .read .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json") .where(s"add is not null") .select(s"add.path") .collect() .map(path => formatPath(path.toString)) .toList val removePathList = spark .read .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json") .where(s"remove is not null") .select(s"remove.path") .collect() .map(path => formatPath(path.toString)) .toList
Run Code Online (Sandbox Code Playgroud)
3）将它们加载到数据框中

import org.apache.spark.sql.functions._ val addDF = spark .read .format("parquet") .load(addPathList: _*) .withColumn("add_remove", lit("add")) val removeDF = spark .read .format("parquet") .load(removePathList: _*) .withColumn("add_remove", lit("remove"))
Run Code Online (Sandbox Code Playgroud)
4）两个数据帧的并集代表“差异”：

addDF.union(removeDF).show() +----------+----------+ |updatedate|add_remove| +----------+----------+ | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| | null| add| +----------+----------+ only showing top 20 rows
Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，11 月前
查看次数：	2365 次
最近记录：	3 年，10 月前