在DataFrameWriter上使用partitionBy会使用列名而不仅仅是值来编写目录布局

sat*_*kum 15 configuration scala apache-spark spark-dataframe

我正在使用Spark 2.0.

我有一个DataFrame.我的代码如下所示:

df.write.partitionBy("year", "month", "day").format("csv").option("header", "true").save(s"s3://bucket/")
Run Code Online (Sandbox Code Playgroud)

当程序执行时,它以下列格式写入文件:

s3://bucket/year=2016/month=11/day=15/file.csv
Run Code Online (Sandbox Code Playgroud)

如何配置格式如下:

s3://bucket/2016/11/15/file.csv
Run Code Online (Sandbox Code Playgroud)

我还想知道是否可以配置文件名.

这里的相关文档看起来很稀疏......
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

partitionBy(colNames: String*): DataFrameWriter[T]
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:

year=2016/month=01/
year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well.
Run Code Online (Sandbox Code Playgroud)

小智 10

这是预期和期望的行为。Spark 使用目录结构进行分区发现和修剪,正确的结构(包括列名)是它工作所必需的。

您还必须记住,分区会删除用于分区的列。

如果您需要特定的目录结构,您应该使用下游过程来重命名目录。