如何从按月分区的 parquet 文件中删除特定月份

Question

如何从按月分区的 parquet 文件中删除特定月份

cph*_*sto 6 python apache-spark parquet pyspark

我有monthly过去 5 年的收入数据，并且我parquet以append模式但列的格式存储各个月份的数据帧。这是下面的伪代码 -partitioned by month

def Revenue(filename):
    df = spark.read.load(filename)
    .
    .
    df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')

Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')

Run Code Online (Sandbox Code Playgroud)

df每月以格式存储，parquet如下所示 -

问：如何删除parquet特定月份对应的文件夹？

一种方法是将所有这些parquet文件加载到一个大文件中df，然后使用.where()子句过滤掉该特定月份，然后将其保存回模式月份parquet格式，如下所示 -partitionByoverwrite

# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')

Run Code Online (Sandbox Code Playgroud)

但是，这种方法相当麻烦。

另一种方法是直接删除该特定月份的文件夹，但我不确定这是否是处理问题的正确方法，以免我们metadata以不可预见的方式更改。

parquet删除特定月份的数据的正确方法是什么？

Answer 1

DaR*_*MaN 2

Spark 支持删除分区，包括数据和元数据。
引用scala代码注释

/**
 * Drop Partition in ALTER TABLE: to drop a particular partition for a table.
 *
 * This removes the data and metadata for this partition.
 * The data is actually moved to the .Trash/Current directory if Trash is configured,
 * unless 'purge' is true, but the metadata is completely lost.
 * An error message will be issued if the partition does not exist, unless 'ifExists' is true.
 * Note: purge is always false when the target is a view.
 *
 * The syntax of this command is:
 * {{{
 *   ALTER TABLE table DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] [PURGE];
 * }}}
 */

Run Code Online (Sandbox Code Playgroud)

就您而言，没有支持表。我们可以将数据帧注册为临时表并使用上述语法（临时表文档）

在 pyspark 中，我们可以使用此链接示例中的语法运行 SQL ：

df = spark.read.format('parquet').load('Revenue.parquet'). registerTempTable("tmp")
spark.sql("ALTER TABLE tmp DROP IF EXISTS PARTITION (month='2015-02-01') PURGE")

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，6 月前
查看次数：	17517 次
最近记录：	2 年，11 月前