在 spark master 分支上-我尝试将带有“a”、“b”、“c”的单列写入镶木地板文件 f1
scala> List("a", "b", "c").toDF("field1").coalesce(1).write.parquet("f1")
Run Code Online (Sandbox Code Playgroud)
但是保存的文件没有统计信息(最小,最大)
$ ls f1/*.parquet
f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
$ parquet-tool meta f1/*.parquet
file: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
creator: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"field1","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
field1: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:3 TS:48 OFFSET:4
--------------------------------------------------------------------------------
field1: BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3 ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
Run Code Online (Sandbox Code Playgroud)
任何指针将不胜感激。谢谢你。
我正在尝试使用镶木地板的最小/最大指数.我在这里跟随问题/答案:Spark Parquet Statistics(最小/最大)集成
scala> val foo = spark.sql("select id, cast(id as string) text from range(1000)").sort("id")
scala> foo.printSchema
root
|-- id: long (nullable = false)
|-- text: string (nullable = false)
Run Code Online (Sandbox Code Playgroud)
当我看一个单独的镶木地板文件时,我没有看到任何最小/最大
> parquet-tools meta part-00000-tid-5174196010762120422-9
5fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
file: file:.../part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
id: REQUIRED INT64 R:0 D:0
text: REQUIRED BINARY O:UTF8 R:0 D:0
row group 1: RC:125 TS:1840 OFFSET:4
--------------------------------------------------------------------------------
id: INT64 GZIP DO:0 FPO:4 SZ:259/1044/4.03 VC:125 ENC:PLAIN,BIT_PACKED
text: …Run Code Online (Sandbox Code Playgroud)