有人可以解释一下 orcfiledump 的输出吗？

Question

有人可以解释一下 orcfiledump 的输出吗？

我的表test_orc包含（对于一个分区）：

col1 col2 part1
abc  def  1
ghi  jkl  1
mno  pqr  1
koi  hai  1
jo   pgl  1
hai  tre  1

Run Code Online (Sandbox Code Playgroud)

通过跑步

hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0

Run Code Online (Sandbox Code Playgroud)

我得到以下信息：

Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .  
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl -  Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .  
Rows: 6 .  
Compression: ZLIB .  
Compression size: 262144 .  
Type: struct<_col0:string,_col1:string> .  

Stripe Statistics:   
  Stripe 1:   
    Column 0: count: 6 .  
    Column 1: count: 6 min: abc max: mno sum: 17 .  
    Column 2: count: 6 min: def max: tre sum: 18 .  

File Statistics:   
  Column 0: count: 6 .  
  Column 1: count: 6 min: abc max: mno sum: 17 .  
  Column 2: count: 6 min: def max: tre sum: 18 .  

Stripes:   
  Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .  
    Stream: column 0 section ROW_INDEX start: 3 length 9 .  
    Stream: column 1 section ROW_INDEX start: 12 length 29 .  
    Stream: column 2 section ROW_INDEX start: 41 length 29 .  
    Stream: column 1 section DATA start: 70 length 20 .  
    Stream: column 1 section LENGTH start: 90 length 12 .  
    Stream: column 2 section DATA start: 102 length 21 .  
    Stream: column 2 section LENGTH start: 123 length 5 .  
    Encoding column 0: DIRECT .  
    Encoding column 1: DIRECT_V2 .  
    Encoding column 2: DIRECT_V2 .

Run Code Online (Sandbox Code Playgroud)

关于条纹的部分是什么意思？

Answer 1

Rah*_*hul 7

首先，让我们看看 ORC 文件是什么样子的。

\n\n

现在上图中以及您的问题中使用了一些关键字！

\n\n

Stripe - 存储在 ORC 文件中的一块数据。任何 ORC 文件都分为这些块，称为条带，每个块大小为 250 MB，包含索引数据、实际数据和存储在该条带中的实际数据的一些元数据。
压缩- 用于压缩存储数据的压缩编解码器。ZLIB 是 ORC 的默认值。
索引数据- 包括每列的最小值和最大值以及每列中的行位置。（也可以包括位字段或布隆过滤器。）行索引条目提供偏移量，使得能够在解压缩块内查找正确的压缩块和字节。 请注意，ORC 索引仅用于选择条带和行组，而不用于回答查询。
行数据- 实际数据。用于表扫描。
条带页脚- 条带页脚包含每列的编码和流的目录（包括它们的位置）。为了描述每个流，ORC 存储流的类型、列 ID 和流\xe2\x80\x99s 大小（以字节为单位）。每个流中存储内容的详细信息取决于列的类型和编码。
Postscript - 保存压缩参数和压缩页脚的大小。
文件页脚- 文件页脚包含文件中的条带列表、每个条带的行数以及每列的数据类型。它还包含列级聚合计数、最小值、最大值和总和。

\n\n

现在！谈论 orcfiledump 的输出。

\n\n

首先是有关您的文件的一般信息。名称、位置、压缩编解码器、压缩大小等。
Stripe统计会列出你的ORC文件中的所有stripe及其对应的信息。您可以查看有关整数列的计数和一些统计信息，例如最小值、最大值、总和等。
文件统计信息与#2 类似。仅针对完整文件，而不是#2 中的每个条带。
最后一部分，即 Stripe 部分，讨论文件中的每一列以及每一列相应的索引信息。

\n\n

此外，您可以使用 orcfiledump 的各种选项来获得“所需”结果。遵循方便的指南。

\n\n

// Hive version 0.11 through 0.14:\nhive --orcfiledump <location-of-orc-file>\n\n// Hive version 1.1.0 and later:\nhive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>\n\n// Hive version 1.2.0 and later:\nhive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>\n\n// Hive version 1.3.0 and later:\nhive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] \n    [--backup-path <new-path>] <location-of-orc-file-or-directory>\n

Run Code Online (Sandbox Code Playgroud)\n\n

遵循上述命令中使用的选项的快速指南。

\n\n

在命令中指定 -d 将导致转储 ORC 文件数据\n而不是元数据（Hive 1.1.0 及更高版本）。
使用逗号分隔的列 ID 列表指定 --rowindex 将导致打印指定列的行索引，其中 0 是包含所有列的顶级结构，1 是第一个列 ID (Hive 1.1.1)。 0 及更高版本）。
在命令中指定 -t 将打印\n编写器的时区 ID。
在命令中指定 -j 将以 JSON\n格式打印 ORC 文件元数据。要漂亮地打印 JSON 元数据，请将 -p 添加到命令中。
在命令中指定 --recover 将恢复 Hive 流生成的损坏的 ORC 文件\n。
指定 --skip-dump 和 --recover 将执行恢复\n而不转储元数据。
使用新路径指定 --backup-path 将使恢复工具\n将损坏的文件移动到指定的备份路径（默认值：/tmp）。
是 ORC 文件的 URI。
是 ORC 文件或目录的 URI。\n 从 Hive 1.3.0 开始，此 URI 可以是包含 ORC 文件的目录。

\n\n

希望有帮助！

\n

归档时间：	7 年，8 月前
查看次数：	6437 次
最近记录：	7 年，8 月前