如何调优hive来查询元数据？

Question

如何调优hive来查询元数据？

KBR*_*KBR 5 performance hadoop hive hdfs tez

如果我在具有某些分区列的表上运行下面的配置单元查询，我想确保配置单元不会执行全表扫描，而只是从元数据本身找出结果。有什么方法可以启用此功能吗？

Select max(partitioned_col) from hive_table ;

Run Code Online (Sandbox Code Playgroud)

现在，当我运行这个查询时，它会启动映射减少任务，并且我确信它正在执行数据扫描，同时它可以很好地从元数据本身中找出价值。

Answer 1

lef*_*oin 5

每次更改数据时计算表统计信息。

ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS FOR COLUMNS;

Run Code Online (Sandbox Code Playgroud)

启用 CBO 和统计数据自动收集：

set hive.cbo.enable=true;
set hive.stats.autogather=true;

Run Code Online (Sandbox Code Playgroud)

使用这些设置启用 CBO 使用统计数据：

set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;

Run Code Online (Sandbox Code Playgroud)

如果没有任何帮助，我建议应用此方法来快速查找最后一个分区： 使用表位置中的 shell 脚本解析最大分区键。下面的命令将打印所有表文件夹路径、排序、获取最新排序、获取最后一个子文件夹名称、解析分区文件夹名称并提取值。您所需要做的就是初始化TABLE_DIR变量并放置the number of partition subfolder in the path：

last_partition=$(hadoop fs -ls $TABLE_DIR/* | awk '{ print $8 }' | sort -r | head -n1 | cut -d / -f [number of partition subfolder in the path here] | cut -d = -f 2

Run Code Online (Sandbox Code Playgroud)

然后使用$last_partition变量传递给您的脚本

  hive -hiveconf last_partition="$last_partition" -f your_script.hql

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，4 月前
查看次数：	1696 次
最近记录：	6 年，11 月前