我正在改进表的性能.
说这个表:
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
STORED AS PARQUET;
Run Code Online (Sandbox Code Playgroud)
我打算通过user_id应用bucketing,因为查询通常将user_id作为子句.
像这样
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
CLUSTERED BY(user_id) INTO 256 BUCKETS
STORED AS PARQUET;
Run Code Online (Sandbox Code Playgroud)
该表将使用Hive创建并加载,并从Impala中查询...
我想知道的是,这个表是否能够提高impala查询的性能 - 我不确定impala如何与存储桶一起工作.