如何评估HIVE中的CTE(共表表达)

pav*_*ddy 5 hive common-table-expression

我的问题是关于性能以及在运行时评估CTE的方式.

我计划通过定义基本投影来重用代码,然后使用不同的过滤器在此基本投影的顶部定义多个CTE.

这是否会导致任何性能问题.更具体地说,每次都会评估基本投影.

例如:

WITH CTE_PERSON as (
   SELECT * FROM PersonTable
),


CTE_PERSON_WITH_AGE as (
   SELECT * FROM CTE_PERSON WHERE age > 24 
),

CTE_PERSON_WITH_AGE_AND_GENDER as (
  SELECT * FROM CTE_PERSON_WITH_AGE WHERE gender = 'm'
),

CTE_PERSON_WITH_NAME as (
  SELECT * FROM CTE_PERSON WHERE name = 'abc'
)
Run Code Online (Sandbox Code Playgroud)
  • 每次来自PersonTable的所有条目都将加载到内存中,然后过滤器将在(或)之后应用
  • 仅将过滤器后的结果集加载到内存中.

Dav*_*itz 7

单次扫描.

注意:
- 单个阶段
- 单个阶段TableScan
-predicate: (((i = 1) and (j = 2)) and (k = 3)) (type: boolean)


create table t (i int,j int,k int);
Run Code Online (Sandbox Code Playgroud)
explain 
with    t1 as (select i,j,k from t  where i=1)
       ,t2 as (select i,j,k from t1 where j=2)
       ,t3 as (select i,j,k from t2 where k=3) 

select * from t3
;
Run Code Online (Sandbox Code Playgroud)
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: t
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Filter Operator
            predicate: (((i = 1) and (j = 2)) and (k = 3)) (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: 1 (type: int), 2 (type: int), 3 (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              ListSink
Run Code Online (Sandbox Code Playgroud)