复合索引:最具选择性的列优先?

Eri*_*ric 20 oracle oracle-10g

我一直在阅读composite indexes有关订购的信息,但我有点困惑。这个文档(不到一半)说

通常,您应该将最常使用的列放在索引中。

然而,不久之后它说

创建一个组合索引,将最具选择性的列放在首位;也就是说,具有最多值的列。

换句话说,Oracle 也在这里

如果所有键在 WHERE 子句中的使用频率相同,则在 CREATE INDEX 语句中将这些键从选择性最高到选择性最低的顺序最能提高查询性能。

但是,我发现了一个不同的答案。它说

将选择最少的列排在最前面,将选择最多的列排在最后。在与列有联系的情况下,更有可能单独使用。

我引用的第一个文档说你应该首先使用最常用的,而 SO 答案说这应该只用于打破平局。然后他们在订购上也有所不同。

文档还谈到skip scanning并说

如果复合索引的前导列中的不同值很少,而索引的非前导键中有许多不同的值,则跳过扫描是有利的。

一篇文章

前缀列应该是最有辨别力,在查询中使用最广泛的

我认为最有辨别力的就是最独特的。

所有这些研究仍然让我想到同样的问题;最具选择性的列应该是第一个还是最后一个?第一列是否应该是抢七局中使用最多且最具选择性的列?

这些文章似乎相互矛盾,但它们确实提供了一些示例。从我收集的,它似乎是更有效的least selective column第一个在订购,如果你正期待Index Skip Scans。但我不确定这是否正确。

ato*_*pas 10

来自 AskTom

(在 9i 中,有一个新的“索引跳过扫描”——在那里搜索以了解它。它使索引 (a,b) OR (b,a) 有时在上述两种情况下都很有用!)

因此,索引中列的顺序取决于您的查询是如何编写的。您希望能够将索引用于尽可能多的查询(以便减少您拥有的所有索引数量)——这将推动列的顺序。没有别的(a 或 b 的选择性根本不算数)。

将复合索引中的列按从最难区分(不同值较少)到最有区别(更不同值)的顺序排列的参数之一是用于索引键压缩。

SQL> create table t as select * from all_objects;

Table created.

SQL> create index t_idx_1 on t(owner,object_type,object_name);

Index created.

SQL> create index t_idx_2 on t(object_name,object_type,owner);

Index created.

SQL> select count(distinct owner), count(distinct object_type), count(distinct object_name ), count(*)  from t;

COUNT(DISTINCTOWNER) COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOBJECT_NAME)      COUNT(*)
-------------------- -------------------------- --------------------------      ----------
                 30                         45                       52205      89807

SQL> analyze index t_idx_1 validate structure; 

Index analyzed.

SQL> select btree_space, pct_used, opt_cmpr_count, opt_cmpr_pctsave from index_stats;

BTREE_SPACE   PCT_USED OPT_CMPR_COUNT OPT_CMPR_PCTSAVE
----------- ---------- -------------- ----------------
    5085584     90          2           28

SQL> analyze index t_idx_2 validate structure; 

Index analyzed.

SQL> select btree_space, pct_used, opt_cmpr_count, opt_cmpr_pctsave  from index_stats; 

BTREE_SPACE   PCT_USED OPT_CMPR_COUNT OPT_CMPR_PCTSAVE
----------- ---------- -------------- ----------------
    5085584     90          1           14
Run Code Online (Sandbox Code Playgroud)

根据指标统计,第一个指标的压缩性更强。

另一个是在查询中如何使用索引。如果您的查询主要使用col1,

例如,如果您有以下查询:

  • select * from t where col1 = :a and col2 = :b;
  • select * from t where col1 = :a;

    - 然后index(col1,col2)会表现得更好。

    如果您的查询主要使用col2,

  • select * from t where col1 = :a and col2 = :b;
  • select * from t where col2 = :b;

    - 然后index(col2,col1)会表现得更好。如果您的所有查询始终指定两列,那么在复合索引中首先出现哪一列并不重要。

    总之,复合索引的列排序的关键考虑因素是索引键压缩以及您将如何在查询中使用此索引。

    参考:

  • 索引中的列顺序
  • 在索引中使用低基数的前导列效率较低(右)?
  • 索引跳过扫描 – 索引列顺序是否更重要?(警告牌)


    Chr*_*xon 5

    选择索引列顺序时,最重要的问题是:

    我的查询中是否存在针对此列的(相等)谓词?

    如果某列从未出现在 where 子句中,则不值得建立索引(1)

    好的,现在您已经有了一个表并针对每一列进行查询。有时不止一个。

    您如何决定对哪些内容建立索引?

    让我们看一个例子。这是一个包含三列的表。一个保存 10 个值,另一个保存 1,000 个,最后 10,000 个:

    create table t(
      few_vals  varchar2(10),
      many_vals varchar2(10),
      lots_vals varchar2(10)
    );
    
    insert into t 
    with rws as (
      select lpad(mod(rownum, 10), 10, '0'), 
             lpad(mod(rownum, 1000), 10, '0'), 
             lpad(rownum, 10, '0') 
      from dual connect by level <= 10000
    )
      select * from rws;
    
    commit;
    
    select count(distinct few_vals),
           count(distinct many_vals) ,
           count(distinct lots_vals) 
    from   t;
    
    COUNT(DISTINCTFEW_VALS)  COUNT(DISTINCTMANY_VALS)  COUNT(DISTINCTLOTS_VALS)  
    10                       1,000                     10,000     
    
    Run Code Online (Sandbox Code Playgroud)

    这些是用零填充的数字。这将有助于稍后阐明有关压缩的要点。

    所以你有三个常见的查询:

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001';
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001';
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001'
    and    few_vals = '0000000001';
    
    Run Code Online (Sandbox Code Playgroud)

    你索引什么?

    仅针对 Few_vals 的索引仅比全表扫描稍好一些:

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001';
    
    select * 
    from table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------  
    | Id  | Operation            | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT     |          |      1 |        |      1 |00:00:00.01 |      61 |  
    |   1 |  SORT AGGREGATE      |          |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   2 |   VIEW               | VW_DAG_0 |      1 |   1000 |   1000 |00:00:00.01 |      61 |  
    |   3 |    HASH GROUP BY     |          |      1 |   1000 |   1000 |00:00:00.01 |      61 |  
    |   4 |     TABLE ACCESS FULL| T        |      1 |   1000 |   1000 |00:00:00.01 |      61 |  
    -------------------------------------------------------------------------------------------
    
    select /*+ index (t (few_vals)) */
           count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      58 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      58 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |   1000 |   1000 |00:00:00.01 |      58 |  
    |   3 |    HASH GROUP BY                       |          |      1 |   1000 |   1000 |00:00:00.01 |      58 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |   1000 |   1000 |00:00:00.01 |      58 |  
    |   5 |      INDEX RANGE SCAN                  | FEW      |      1 |   1000 |   1000 |00:00:00.01 |       5 |  
    -------------------------------------------------------------------------------------------------------------
    
    Run Code Online (Sandbox Code Playgroud)

    因此它本身不太值得建立索引。对lots_vals 的查询返回几行(在本例中仅返回1 行)。所以这绝对值得建立索引。

    但是针对这两列的查询又如何呢?

    你应该索引:

    ( few_vals, lots_vals )
    
    Run Code Online (Sandbox Code Playgroud)

    或者

    ( lots_vals, few_vals )
    
    Run Code Online (Sandbox Code Playgroud)

    诡计问题!

    答案是否定的。

    当然,few_vals 是一个长字符串。所以你可以从中获得良好的压缩效果。并且您(可能)使用仅在lots_vals上有谓词的(few_vals,lots_vals)对查询进行索引跳过扫描。但我不在这里,尽管它的性能明显优于完整扫描:

    create index few_lots on t(few_vals, lots_vals);
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------  
    | Id  | Operation            | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT     |          |      1 |        |      1 |00:00:00.01 |      61 |  
    |   1 |  SORT AGGREGATE      |          |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   2 |   VIEW               | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   3 |    HASH GROUP BY     |          |      1 |      1 |      1 |00:00:00.01 |      61 |  
    |   4 |     TABLE ACCESS FULL| T        |      1 |      1 |      1 |00:00:00.01 |      61 |  
    -------------------------------------------------------------------------------------------  
    
    select /*+ index_ss (t few_lots) */count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    ----------------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
    ----------------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      13 |     11 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |  
    |   5 |      INDEX SKIP SCAN                   | FEW_LOTS |      1 |     40 |      1 |00:00:00.01 |      12 |     11 |  
    ----------------------------------------------------------------------------------------------------------------------
    
    Run Code Online (Sandbox Code Playgroud)

    你喜欢赌博吗?(2)

    所以你仍然需要一个以lots_vals作为前导列的索引。至少在这种情况下,复合索引(很少,很多)所做的工作量与仅(很多)索引的工作量相同

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001'
    and    few_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |       3 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |       3 |  
    |   5 |      INDEX RANGE SCAN                  | FEW_LOTS |      1 |      1 |      1 |00:00:00.01 |       2 |  
    -------------------------------------------------------------------------------------------------------------  
    
    create index lots on t(lots_vals);
    
    select /*+ index (t (lots_vals)) */count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  lots_vals = '0000000001'
    and    few_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    ----------------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
    ----------------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |       3 |      1 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |  
    |   5 |      INDEX RANGE SCAN                  | LOTS     |      1 |      1 |      1 |00:00:00.01 |       2 |      1 |  
    ----------------------------------------------------------------------------------------------------------------------  
    
    Run Code Online (Sandbox Code Playgroud)

    在某些情况下,复合索引可以节省 1-2 个 IO。但是为了节省成本值得使用两个索引吗?

    综合索引还有另一个问题。比较包括 LOTS_VALS 在内的三个索引的聚类因子:

    create index lots on t(lots_vals);
    create index lots_few on t(lots_vals, few_vals);
    create index few_lots on t(few_vals, lots_vals);
    
    select index_name, leaf_blocks, distinct_keys, clustering_factor
    from   user_indexes
    where  table_name = 'T';
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    FEW_LOTS    47           10,000         530                
    LOTS_FEW    47           10,000         53                 
    LOTS        31           10,000         53                 
    FEW         31           10             530    
    
    Run Code Online (Sandbox Code Playgroud)

    请注意,few_lots 的聚类因子比lots 和lots_few高10 倍!这是一个具有完美聚类的演示表。在现实世界的数据库中,效果可能更糟。

    那么这有什么不好呢?

    聚类因子是决定索引“吸引力”程度的关键驱动因素之一。它越高,优化器选择它的可能性就越小。特别是如果lots_vals实际上不是唯一的,但通常每个值仍然有几行。如果你不幸的话,这可能足以让优化器认为完整扫描更便宜......

    好的,因此具有 Few_vals 和lots_vals 的复合索引仅具有边缘情况优势。

    过滤 Few_vals 和 Many_vals 的查询怎么样?

    单列索引只能带来很小的好处。但它们组合起来返回的值很少。所以综合指数是一个好主意。但到底是哪边呢?

    如果先放置几个,压缩前导列将使该列变小

    create index few_many on t(many_vals, few_vals);
    create index many_few on t(few_vals, many_vals);
        
    select index_name, leaf_blocks, distinct_keys, clustering_factor 
    from   user_indexes
    where  index_name in ('FEW_MANY', 'MANY_FEW');
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    FEW_MANY    47           1,000          10,000             
    MANY_FEW    47           1,000          10,000   
        
    alter index few_many rebuild compress 1;
    alter index many_few rebuild compress 1;
        
    select index_name, leaf_blocks, distinct_keys, clustering_factor 
    from   user_indexes
    where  index_name in ('FEW_MANY', 'MANY_FEW');
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    MANY_FEW    31           1,000          10,000             
    FEW_MANY    34           1,000          10,000      
    
    Run Code Online (Sandbox Code Playgroud)

    前导列中的不同值越少,压缩效果越好。因此,阅读该索引的工作量略有减少。但只是轻微的。而且两者都已经比原来小了很多(大小减少了 25%)。

    您还可以进一步压缩整个索引!

    alter index few_many rebuild compress 2;
    alter index many_few rebuild compress 2;
    
    select index_name, leaf_blocks, distinct_keys, clustering_factor 
    from   user_indexes
    where  index_name in ('FEW_MANY', 'MANY_FEW');
    
    INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR  
    FEW_MANY    20           1,000          10,000             
    MANY_FEW    20           1,000          10,000   
    
    Run Code Online (Sandbox Code Playgroud)

    现在两个索引都恢复到相同的大小。请注意,这利用了少数与多数之间存在关系的事实。同样,您不太可能在现实世界中看到这种好处。

    到目前为止我们只讨论了平等检查。通常,使用复合索引时,您会对其中一列产生不等式。例如“获取客户过去 N 天内的订单/发货/发票”之类的查询。

    如果您有这些类型的查询,您希望与索引的第一列相等:

    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals < '0000000002'
    and    many_vals = '0000000001';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    -------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  
    -------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      12 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      12 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |     10 |     10 |00:00:00.01 |      12 |  
    |   3 |    HASH GROUP BY                       |          |      1 |     10 |     10 |00:00:00.01 |      12 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |     10 |     10 |00:00:00.01 |      12 |  
    |   5 |      INDEX RANGE SCAN                  | FEW_MANY |      1 |     10 |     10 |00:00:00.01 |       2 |  
    -------------------------------------------------------------------------------------------------------------  
    
    select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
    from   t
    where  few_vals = '0000000001'
    and    many_vals < '0000000002';
    
    select * 
    from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));
    
    ----------------------------------------------------------------------------------------------------------------------  
    | Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
    ----------------------------------------------------------------------------------------------------------------------  
    |   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      12 |      1 |  
    |   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      12 |      1 |  
    |   2 |   VIEW                                 | VW_DAG_0 |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |  
    |   3 |    HASH GROUP BY                       |          |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |  
    |   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |  
    |   5 |      INDEX RANGE SCAN                  | MANY_FEW |      1 |      1 |     10 |00:00:00.01 |       2 |      1 |  
    ----------------------------------------------------------------------------------------------------------------------  
    
    Run Code Online (Sandbox Code Playgroud)

    请注意,他们使用相反的索引。

    长话短说

    • 具有相等条件的列应在索引中排在第一位。
    • 如果查询中有多个具有相等性的列,则将不同值最少的列放在前面将获得最佳的压缩优势
    • 虽然索引跳过扫描是可能的,但您需要确信这在可预见的未来仍然是一个可行的选择
    • 包含近乎唯一的列的复合索引带来的好处微乎其微。确保您确实需要保存 1-2 个 IO

    1:在某些情况下,如果这意味着查询中的所有列都在索引中,则可能值得在索引中包含一列。这将启用仅索引扫描,因此您无需访问表。

    2:如果您获得了诊断和调优许可,则可以使用 SQL 计划管理强制计划进行跳过扫描

    附加物

    PS - 您引用的文档来自 9i。那真的很老了。我会坚持使用最近的东西