在Spanner中避免使用IN子句和子查询进行散列连接

Joe*_*ber 10 sql google-cloud-platform google-cloud-spanner

我在Spanner中有以下查询优化问题,并希望有一个我缺少的技巧,这将帮助我将查询规划器弯曲到我的意愿.

这是简化的架构:

create table T0 (
  key0  int64 not null,
  value int64,
  other int64 not null,
) primary key (key0);

create table T1 {
  key1  int64 not null,
  other int64 not null
} primary key (key1);
Run Code Online (Sandbox Code Playgroud)

并在IN子句中使用子查询进行查询:

select value from T0 t0
where t0.other in (
  select t1.other from T1 t1 where t1.key1 in (42, 43, 44)  -- note: this subquery is a good deal more complex than this
)
Run Code Online (Sandbox Code Playgroud)

通过T0的散列连接与子查询的输出生成10个元素集:

Operator                     Rows  Executions
-----------------------      ----- ----------
Serialize Result               10          1
Hash Join                      10          1
  Distributed union         10000          1
    Local distributed union 10000          1
    Table Scan: T0          10000          1
  Distributed cross apply:      5          1
   ...lots moar T1 subquery stuff...
Run Code Online (Sandbox Code Playgroud)

请注意,虽然子查询很复杂,但它实际上会生成一个非常小的集合.不幸的是,它还扫描整个 T1以提供给散列连接,这非常慢.

但是,如果我在T1上获取子查询的输出并手动将其推入IN子句:

select value from T0
where other in (5, 6, 7, 8, 9)  -- presume this `IN` clause to be the output of the above subquery
Run Code Online (Sandbox Code Playgroud)

它的速度要快得多,大概是因为它只是每次进入一次T0的索引,而不是在完整内容上使用散列连接:

Operator                Rows Executions
----------------------- ---- ----------
Distributed union         10          1
Local distributed union   10          1
Serialize Result          10          1
Filter                    10          1
Index Scan:               10          1
Run Code Online (Sandbox Code Playgroud)

我可以简单地运行两个查询,这是我迄今为止最好的计划.但是我希望我能找到一些方法来让Spanner决定这是它应该在第一个例子中对子查询的输出做什么.我已经尝试了所有我能想到的东西,但这根本不可能在SQL中表达出来.

另外:我还没有完全证明这一点,但在某些情况下我担心10元子查询输出可能会爆炸到几千个元素(T1会或多或少地增长而不受约束,很容易增加到数百万).我已经在splatted-out IN子句中手动测试了几百个元素,它似乎表现得很可接受,但我有点担心它可能会失控.

请注意,我也尝试了子查询的连接,如下所示:

select t0.other from T0 t0
join (
  -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
  -- enough that it can't be expressed that way.
  select t1.other from T1 t1 where t1.key = 42
) sub on sub.other = t0.other
Run Code Online (Sandbox Code Playgroud)

但它在查询规划器中做了一件非常可怕的事情,我甚至不会在这里解释.

Mik*_*iss 2

子句中的实际子查询是否IN使用来自 的任何变量T0?如果不是,如果您尝试对重新排序的表进行联接查询(并添加不同的值以确保正确性,除非您知道这些值将是不同的),会发生什么?

SELECT t0.other FROM  (
      -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
      -- enough that it can't be expressed that way.
      SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
    ) sub 
JOIN T0 t0
ON sub.other = t0.other
Run Code Online (Sandbox Code Playgroud)