如何在拼花元数据中查看最小/最大索引?

Jos*_*osh 4 apache-spark parquet databricks

我正在尝试使用镶木地板的最小/最大指数.我在这里跟随问题/答案:Spark Parquet Statistics(最小/最大)集成

scala> val foo = spark.sql("select id, cast(id as string) text from range(1000)").sort("id") 

scala> foo.printSchema

root
 |-- id: long (nullable = false)
 |-- text: string (nullable = false)
Run Code Online (Sandbox Code Playgroud)

当我看一个单独的镶木地板文件时,我没有看到任何最小/最大

> parquet-tools meta part-00000-tid-5174196010762120422-9

5fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
file:        file:.../part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
creator:     parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}

file schema: spark_schema
--------------------------------------------------------------------------------
id:          REQUIRED INT64 R:0 D:0
text:        REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:125 TS:1840 OFFSET:4
--------------------------------------------------------------------------------
id:           INT64 GZIP DO:0 FPO:4 SZ:259/1044/4.03 VC:125 ENC:PLAIN,BIT_PACKED
text:         BINARY GZIP DO:0 FPO:263 SZ:263/796/3.03 VC:125 ENC:PLAIN,BIT_PACKED
Run Code Online (Sandbox Code Playgroud)

我尝试过.sortWithinPartitions("id"),结果相同.

Ste*_*fey 5

您可以使用镶木地板工具查看统计数据.在你的情况下,你会跑

parquet-tools dump -d -n part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet

截至今天(2017年6月9日),带有Parquet 1.8.1的Spark 2.1.1不会生成字符串等二进制列的统计信息.